Skip to content

cohere

mindnlp.transformers.models.cohere.configuration_cohere

Cohere model configuration

mindnlp.transformers.models.cohere.configuration_cohere.CohereConfig

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [CohereModel]. It is used to instantiate an Cohere model according to the specified arguments, defining the model architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information. Instantiating a configuration with the defaults will yield a similar configuration to that of the CohereForAI/c4ai-command-r-v01 model.

PARAMETER DESCRIPTION
vocab_size

Vocabulary size of the Cohere model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [CohereModel]

TYPE: `int`, *optional*, defaults to 256000 DEFAULT: 256000

hidden_size

Dimension of the hidden representations.

TYPE: `int`, *optional*, defaults to 8192 DEFAULT: 8192

intermediate_size

Dimension of the MLP representations.

TYPE: `int`, *optional*, defaults to 22528 DEFAULT: 22528

logit_scale

The scaling factor for the output logits.

TYPE: `float`, *optional*, defaults to 0.0625 DEFAULT: 0.0625

num_hidden_layers

Number of hidden layers in the Transformer decoder.

TYPE: `int`, *optional*, defaults to 40 DEFAULT: 40

num_attention_heads

Number of attention heads for each attention layer in the Transformer decoder.

TYPE: `int`, *optional*, defaults to 64 DEFAULT: 64

num_key_value_heads

This is the number of key_value heads that should be used to implement Grouped Query Attention. If num_key_value_heads=num_attention_heads, the model will use Multi Head Attention (MHA), if num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be forwarded by meanpooling all the original heads within that group. For more details checkout this paper. If it is not specified, will default to num_attention_heads.

TYPE: `int`, *optional* DEFAULT: None

hidden_act

The non-linear activation function (function or string) in the decoder.

TYPE: `str` or `function`, *optional*, defaults to `"silu"` DEFAULT: 'silu'

max_position_embeddings

The maximum sequence length that this model might ever be used with.

TYPE: `int`, *optional*, defaults to 8192 DEFAULT: 8192

initializer_range

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

TYPE: `float`, *optional*, defaults to 0.02 DEFAULT: 0.02

layer_norm_eps

The epsilon used by the layer normalization.

TYPE: `float`, *optional*, defaults to 1e-05 DEFAULT: 1e-05

use_cache

Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

pad_token_id

Padding token id.

TYPE: `int`, *optional*, defaults to 0 DEFAULT: 0

bos_token_id

Beginning of stream token id.

TYPE: `int`, *optional*, defaults to 5 DEFAULT: 5

eos_token_id

End of stream token id.

TYPE: `int`, *optional*, defaults to 255001 DEFAULT: 255001

tie_word_embeddings

Whether to tie weight embeddings

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

rope_theta

The base period of the RoPE embeddings.

TYPE: `float`, *optional*, defaults to 10000.0 DEFAULT: 10000.0

attention_bias

Whether to use a bias in the query, key, value and output projection layers during self-attention.

TYPE: `bool`, defaults to `False`, *optional*, defaults to `False` DEFAULT: False

attention_dropout

The dropout ratio for the attention probabilities.

TYPE: `float`, *optional*, defaults to 0.0 DEFAULT: 0.0

use_qk_norm

Whether to use query-key normalization in the attention

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

Example
>>> from transformers import CohereModel, CohereConfig
...
>>> # Initializing a Cohere model configuration
>>> configuration = CohereConfig()
...
>>> # Initializing a model from the Cohere configuration
>>> model = CohereModel(configuration) # doctest: +SKIP
...
>>> # Accessing the model configuration
>>> configuration = model.config # doctest: +SKIP
Source code in mindnlp/transformers/models/cohere/configuration_cohere.py
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
class CohereConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`CohereModel`]. It is used to instantiate an Cohere
    model according to the specified arguments, defining the model architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.
    Instantiating a configuration with the defaults will yield a similar configuration to that of the
    [CohereForAI/c4ai-command-r-v01](https://huggingface.co/CohereForAI/c4ai-command-r-v01) model.

    Args:
        vocab_size (`int`, *optional*, defaults to 256000):
            Vocabulary size of the Cohere model. Defines the number of different tokens that can be represented by the
            `inputs_ids` passed when calling [`CohereModel`]
        hidden_size (`int`, *optional*, defaults to 8192):
            Dimension of the hidden representations.
        intermediate_size (`int`, *optional*, defaults to 22528):
            Dimension of the MLP representations.
        logit_scale (`float`, *optional*, defaults to 0.0625):
            The scaling factor for the output logits.
        num_hidden_layers (`int`, *optional*, defaults to 40):
            Number of hidden layers in the Transformer decoder.
        num_attention_heads (`int`, *optional*, defaults to 64):
            Number of attention heads for each attention layer in the Transformer decoder.
        num_key_value_heads (`int`, *optional*):
            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
            `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be forwarded
            by meanpooling all the original heads within that group. For more details checkout [this
            paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
            `num_attention_heads`.
        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
            The non-linear activation function (function or string) in the decoder.
        max_position_embeddings (`int`, *optional*, defaults to 8192):
            The maximum sequence length that this model might ever be used with.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        layer_norm_eps (`float`, *optional*, defaults to 1e-05):
            The epsilon used by the layer normalization.
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models). Only
            relevant if `config.is_decoder=True`.
        pad_token_id (`int`, *optional*, defaults to 0):
            Padding token id.
        bos_token_id (`int`, *optional*, defaults to 5):
            Beginning of stream token id.
        eos_token_id (`int`, *optional*, defaults to 255001):
            End of stream token id.
        tie_word_embeddings (`bool`, *optional*, defaults to `True`):
            Whether to tie weight embeddings
        rope_theta (`float`, *optional*, defaults to 10000.0):
            The base period of the RoPE embeddings.
        attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
            Whether to use a bias in the query, key, value and output projection layers during self-attention.
        attention_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the attention probabilities.
        use_qk_norm (`bool`, *optional*, defaults to `False`):
            Whether to use query-key normalization in the attention

    Example:
        ```python
        >>> from transformers import CohereModel, CohereConfig
        ...
        >>> # Initializing a Cohere model configuration
        >>> configuration = CohereConfig()
        ...
        >>> # Initializing a model from the Cohere configuration
        >>> model = CohereModel(configuration) # doctest: +SKIP
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config # doctest: +SKIP
        ```
        """

    model_type = "cohere"
    keys_to_ignore_at_inference = ["past_key_values"]

    def __init__(
        self,
        vocab_size=256000,
        hidden_size=8192,
        intermediate_size=22528,
        logit_scale=0.0625,
        num_hidden_layers=40,
        num_attention_heads=64,
        num_key_value_heads=None,
        hidden_act="silu",
        max_position_embeddings=8192,
        initializer_range=0.02,
        layer_norm_eps=1e-5,
        use_cache=True,
        pad_token_id=0,
        bos_token_id=5,
        eos_token_id=255001,
        tie_word_embeddings=True,
        rope_theta=10000.0,
        attention_bias=False,
        attention_dropout=0.0,
        use_qk_norm=False,
        **kwargs,
    ):
        self.vocab_size = vocab_size
        self.max_position_embeddings = max_position_embeddings
        self.hidden_size = hidden_size
        self.logit_scale = logit_scale
        self.intermediate_size = intermediate_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads

        # for backward compatibility
        if num_key_value_heads is None:
            num_key_value_heads = num_attention_heads

        self.num_key_value_heads = num_key_value_heads
        self.hidden_act = hidden_act
        self.initializer_range = initializer_range
        self.layer_norm_eps = layer_norm_eps
        self.use_cache = use_cache
        self.rope_theta = rope_theta
        self.attention_bias = attention_bias
        self.attention_dropout = attention_dropout
        self.use_qk_norm = use_qk_norm

        super().__init__(
            pad_token_id=pad_token_id,
            bos_token_id=bos_token_id,
            eos_token_id=eos_token_id,
            tie_word_embeddings=tie_word_embeddings,
            **kwargs,
        )

mindnlp.transformers.models.cohere.modeling_cohere

MindSpore Cohere model.

mindnlp.transformers.models.cohere.modeling_cohere.CohereAttention

Bases: Module

Multi-headed attention from 'Attention Is All You Need' paper

Source code in mindnlp/transformers/models/cohere/modeling_cohere.py
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
class CohereAttention(nn.Module):
    """Multi-headed attention from 'Attention Is All You Need' paper"""

    def __init__(self, config: CohereConfig, layer_idx: Optional[int] = None):
        super().__init__()
        self.config = config
        self.layer_idx = layer_idx
        if layer_idx is None:
            logger.warning_once(
                f"Instantiating {self.__class__.__name__} without passing a `layer_idx` is not recommended and will "
                "lead to errors during the forward call if caching is used. Please make sure to provide a `layer_idx` "
                "when creating this class."
            )

        self.attention_dropout = config.attention_dropout
        self.hidden_size = config.hidden_size
        self.num_heads = config.num_attention_heads
        self.head_dim = self.hidden_size // self.num_heads
        self.num_key_value_heads = config.num_key_value_heads
        self.num_key_value_groups = self.num_heads // self.num_key_value_heads
        self.max_position_embeddings = config.max_position_embeddings
        self.rope_theta = config.rope_theta
        self.is_causal = True
        self.use_qk_norm = config.use_qk_norm

        if (self.head_dim * self.num_heads) != self.hidden_size:
            raise ValueError(
                f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
                f" and `num_heads`: {self.num_heads})."
            )

        if self.use_qk_norm:
            # When sharding the model using Tensor Parallelism, need to be careful to use n_local_heads
            self.q_norm = CohereLayerNorm(hidden_size=(self.num_heads, self.head_dim), eps=config.layer_norm_eps)
            self.k_norm = CohereLayerNorm(
                hidden_size=(self.num_key_value_heads, self.head_dim), eps=config.layer_norm_eps
            )

        self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias)
        self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias)
        self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias)
        self.o_proj = nn.Linear(self.hidden_size, self.hidden_size, bias=config.attention_bias)
        self._init_rope()

    # Ignore copy
    def _init_rope(self):
        self.rotary_emb = CohereRotaryEmbedding(
            self.head_dim,
            max_position_embeddings=self.max_position_embeddings,
            base=self.rope_theta,
        )

    # Ignore copy
    def forward(
        self,
        hidden_states: mindspore.Tensor,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        past_key_value: Optional[Cache] = None,
        output_attentions: bool = False,
        use_cache: bool = False,
        cache_position: Optional[mindspore.Tensor] = None,
        **kwargs,
    ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]:
        bsz, q_len, _ = hidden_states.shape

        query_states = self.q_proj(hidden_states)
        key_states = self.k_proj(hidden_states)
        value_states = self.v_proj(hidden_states)

        query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim)
        key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim)
        if self.use_qk_norm:
            query_states = self.q_norm(query_states)
            key_states = self.k_norm(key_states)

        query_states = query_states.swapaxes(1, 2)
        key_states = key_states.swapaxes(1, 2)
        value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).swapaxes(1, 2)

        cos, sin = self.rotary_emb(value_states, position_ids)
        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)

        if past_key_value is not None:
            # sin and cos are specific to RoPE models; position_ids needed for the static cache
            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
            key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)

        key_states = repeat_kv(key_states, self.num_key_value_groups)
        value_states = repeat_kv(value_states, self.num_key_value_groups)

        attn_weights = ops.matmul(query_states, key_states.swapaxes(2, 3)) / math.sqrt(self.head_dim)

        if attention_mask is not None:  # no matter the length, we just slice it
            causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
            attn_weights = attn_weights + causal_mask

        # upcast attention to fp32
        attn_weights = ops.softmax(attn_weights, axis=-1, dtype=mindspore.float32).to(query_states.dtype)
        attn_weights = ops.dropout(attn_weights, p=self.attention_dropout, training=self.training)
        attn_output = ops.matmul(attn_weights, value_states)

        if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim):
            raise ValueError(
                f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
                f" {attn_output.shape}"
            )

        attn_output = attn_output.swapaxes(1, 2)

        attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)

        attn_output = self.o_proj(attn_output)

        if not output_attentions:
            attn_weights = None

        return attn_output, attn_weights, past_key_value

mindnlp.transformers.models.cohere.modeling_cohere.CohereDecoderLayer

Bases: Module

Source code in mindnlp/transformers/models/cohere/modeling_cohere.py
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
class CohereDecoderLayer(nn.Module):
    def __init__(self, config: CohereConfig, layer_idx: int):
        super().__init__()
        self.hidden_size = config.hidden_size

        self.self_attn = COHERE_ATTENTION_CLASSES[config._attn_implementation](config=config, layer_idx=layer_idx)

        self.mlp = CohereMLP(config)
        self.input_layernorm = CohereLayerNorm(hidden_size=(config.hidden_size), eps=config.layer_norm_eps)

    def forward(
        self,
        hidden_states: mindspore.Tensor,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        past_key_value: Optional[Tuple[mindspore.Tensor]] = None,
        output_attentions: Optional[bool] = False,
        use_cache: Optional[bool] = False,
        cache_position: Optional[mindspore.Tensor] = None,
    ) -> Tuple[mindspore.Tensor, Optional[Tuple[mindspore.Tensor, mindspore.Tensor]]]:
        """
        Args:
            hidden_states (`mindspore.Tensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
            attention_mask (`mindspore.Tensor`, *optional*):
                attention mask of size `(batch_size, sequence_length)` if flash attention is used or `(batch_size, 1,
                query_sequence_length, key_sequence_length)` if default attention is used.
            output_attentions (`bool`, *optional*):
                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
                returned tensors for more detail.
            use_cache (`bool`, *optional*):
                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
                (see `past_key_values`).
            past_key_value (`Tuple(mindspore.Tensor)`, *optional*): cached past key and value projection states
        """
        residual = hidden_states

        hidden_states = self.input_layernorm(hidden_states)

        # Self Attention
        hidden_states_attention, self_attn_weights, present_key_value = self.self_attn(
            hidden_states=hidden_states,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_value=past_key_value,
            output_attentions=output_attentions,
            use_cache=use_cache,
            cache_position=cache_position,
        )
        # Fully Connected
        hidden_states_mlp = self.mlp(hidden_states)

        # Add everything together
        hidden_states = residual + hidden_states_attention + hidden_states_mlp

        outputs = (hidden_states,)

        if output_attentions:
            outputs += (self_attn_weights,)

        if use_cache:
            outputs += (present_key_value,)

        return outputs

mindnlp.transformers.models.cohere.modeling_cohere.CohereDecoderLayer.forward(hidden_states, attention_mask=None, position_ids=None, past_key_value=None, output_attentions=False, use_cache=False, cache_position=None)

PARAMETER DESCRIPTION
hidden_states

input to the layer of shape (batch, seq_len, embed_dim)

TYPE: `mindspore.Tensor`

attention_mask

attention mask of size (batch_size, sequence_length) if flash attention is used or (batch_size, 1, query_sequence_length, key_sequence_length) if default attention is used.

TYPE: `mindspore.Tensor`, *optional* DEFAULT: None

output_attentions

Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

TYPE: `bool`, *optional* DEFAULT: False

use_cache

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

TYPE: `bool`, *optional* DEFAULT: False

past_key_value

cached past key and value projection states

TYPE: `Tuple(mindspore.Tensor)`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/cohere/modeling_cohere.py
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
def forward(
    self,
    hidden_states: mindspore.Tensor,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    past_key_value: Optional[Tuple[mindspore.Tensor]] = None,
    output_attentions: Optional[bool] = False,
    use_cache: Optional[bool] = False,
    cache_position: Optional[mindspore.Tensor] = None,
) -> Tuple[mindspore.Tensor, Optional[Tuple[mindspore.Tensor, mindspore.Tensor]]]:
    """
    Args:
        hidden_states (`mindspore.Tensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
        attention_mask (`mindspore.Tensor`, *optional*):
            attention mask of size `(batch_size, sequence_length)` if flash attention is used or `(batch_size, 1,
            query_sequence_length, key_sequence_length)` if default attention is used.
        output_attentions (`bool`, *optional*):
            Whether or not to return the attentions tensors of all attention layers. See `attentions` under
            returned tensors for more detail.
        use_cache (`bool`, *optional*):
            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
            (see `past_key_values`).
        past_key_value (`Tuple(mindspore.Tensor)`, *optional*): cached past key and value projection states
    """
    residual = hidden_states

    hidden_states = self.input_layernorm(hidden_states)

    # Self Attention
    hidden_states_attention, self_attn_weights, present_key_value = self.self_attn(
        hidden_states=hidden_states,
        attention_mask=attention_mask,
        position_ids=position_ids,
        past_key_value=past_key_value,
        output_attentions=output_attentions,
        use_cache=use_cache,
        cache_position=cache_position,
    )
    # Fully Connected
    hidden_states_mlp = self.mlp(hidden_states)

    # Add everything together
    hidden_states = residual + hidden_states_attention + hidden_states_mlp

    outputs = (hidden_states,)

    if output_attentions:
        outputs += (self_attn_weights,)

    if use_cache:
        outputs += (present_key_value,)

    return outputs

mindnlp.transformers.models.cohere.modeling_cohere.CohereForCausalLM

Bases: CoherePreTrainedModel

Source code in mindnlp/transformers/models/cohere/modeling_cohere.py
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
class CohereForCausalLM(CoherePreTrainedModel):
    _tied_weights_keys = ["lm_head.weight"]

    # Ignore copy
    def __init__(self, config):
        super().__init__(config)
        self.model = CohereModel(config)
        self.vocab_size = config.vocab_size
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
        self.logit_scale = config.logit_scale
        self.tie_word_embeddings = config.tie_word_embeddings
        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        return self.model.embed_tokens

    def set_input_embeddings(self, value):
        self.model.embed_tokens = value

    def get_output_embeddings(self):
        return self.lm_head

    def set_output_embeddings(self, new_embeddings):
        self.lm_head = new_embeddings

    def set_decoder(self, decoder):
        self.model = decoder

    def get_decoder(self):
        return self.model

    def forward(
        self,
        input_ids: mindspore.Tensor = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        past_key_values: Optional[List[mindspore.Tensor]] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        labels: Optional[mindspore.Tensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        cache_position: Optional[mindspore.Tensor] = None,
    ) -> Union[Tuple, CausalLMOutputWithPast]:
        r"""
        Args:
            labels (`mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.

        Returns:
            `Union[Tuple, CausalLMOutputWithPast]`

        Example:
            ```python
            >>> from transformers import AutoTokenizer, CohereForCausalLM
            ...
            >>> model = CohereForCausalLM.from_pretrained("CohereForAI/c4ai-command-r-v01")
            >>> tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")
            ...
            >>> prompt = "Hey, are you conscious? Can you talk to me?"
            >>> inputs = tokenizer(prompt, return_tensors="pt")
            ...
            >>> # Generate
            >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
            >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
            "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
            ```
        """
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
        outputs = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
            cache_position=cache_position,
        )

        hidden_states = outputs[0]
        logits = self.lm_head(hidden_states)
        logits = logits * self.logit_scale
        logits = logits.float()

        loss = None
        if labels is not None:
            # Shift so that tokens < n predict n
            shift_logits = logits[..., :-1, :]
            shift_labels = labels[..., 1:]
            # Flatten the tokens
            shift_logits = shift_logits.view(-1, self.config.vocab_size)
            shift_labels = shift_labels.view(-1)
            # Enable model parallelism
            loss = ops.cross_entropy(shift_logits, shift_labels)

        if not return_dict:
            output = (logits,) + outputs[1:]
            return (loss,) + output if loss is not None else output

        return CausalLMOutputWithPast(
            loss=loss,
            logits=logits,
            past_key_values=outputs.past_key_values,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

    def prepare_inputs_for_generation(
        self,
        input_ids,
        past_key_values=None,
        attention_mask=None,
        inputs_embeds=None,
        cache_position=None,
        use_cache=None,
        **kwargs,
    ):
        use_cache = use_cache if use_cache is not None else self.config.use_cache

        past_length = 0
        if past_key_values is not None:
            if isinstance(past_key_values, Cache):
                past_length = cache_position[0] if cache_position is not None else past_key_values.get_seq_length()
                max_cache_length = (
                    mindspore.tensor(past_key_values.get_max_length())
                    if past_key_values.get_max_length() is not None
                    else None
                )
                cache_length = past_length if max_cache_length is None else ops.min(max_cache_length, past_length)
            # TODO joao: remove this `else` after `generate` prioritizes `Cache` objects
            else:
                cache_length = past_length = past_key_values[0][0].shape[2]
                max_cache_length = None

            # Keep only the unprocessed tokens:
            # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
            # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as input)
            if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
                input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
            # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
            # input_ids based on the past_length.
            elif past_length < input_ids.shape[1]:
                input_ids = input_ids[:, past_length:]
            # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.

            # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
            if (
                max_cache_length is not None
                and attention_mask is not None
                and cache_length + input_ids.shape[1] > max_cache_length
            ):
                attention_mask = attention_mask[:, -max_cache_length:]

        position_ids = kwargs.get("position_ids", None)
        if attention_mask is not None and position_ids is None:
            # create position_ids on the fly for batch generation
            position_ids = attention_mask.long().cumsum(-1) - 1
            position_ids = position_ids.masked_fill(attention_mask == 0, 1)
            if past_key_values:
                position_ids = position_ids[:, -input_ids.shape[1] :]

        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
        if inputs_embeds is not None and past_key_values is None:
            model_inputs = {"inputs_embeds": inputs_embeds}
        else:
            # The `contiguous()` here is necessary to have a static stride during decoding. torchdynamo otherwise
            # recompiles graphs as the stride of the inputs is a guard. Ref: https://github.com/huggingface/transformers/pull/29114
            # TODO: use `next_tokens` directly instead.
            model_inputs = {"input_ids": input_ids}

        input_length = position_ids.shape[-1] if position_ids is not None else input_ids.shape[-1]
        if cache_position is None:
            cache_position = ops.arange(past_length, past_length + input_length)
        elif use_cache:
            cache_position = cache_position[-input_length:]

        model_inputs.update(
            {
                "position_ids": position_ids,
                "cache_position": cache_position,
                "past_key_values": past_key_values,
                "use_cache": use_cache,
                "attention_mask": attention_mask,
            }
        )
        return model_inputs

    @staticmethod
    def _reorder_cache(past_key_values, beam_idx):
        reordered_past = ()
        for layer_past in past_key_values:
            reordered_past += (
                tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),
            )
        return reordered_past

mindnlp.transformers.models.cohere.modeling_cohere.CohereForCausalLM.forward(input_ids=None, attention_mask=None, position_ids=None, past_key_values=None, inputs_embeds=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None, cache_position=None)

PARAMETER DESCRIPTION
labels

Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size].

TYPE: `mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional* DEFAULT: None

RETURNS DESCRIPTION
Union[Tuple, CausalLMOutputWithPast]

Union[Tuple, CausalLMOutputWithPast]

Example
>>> from transformers import AutoTokenizer, CohereForCausalLM
...
>>> model = CohereForCausalLM.from_pretrained("CohereForAI/c4ai-command-r-v01")
>>> tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")
...
>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")
...
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
Source code in mindnlp/transformers/models/cohere/modeling_cohere.py
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
def forward(
    self,
    input_ids: mindspore.Tensor = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    past_key_values: Optional[List[mindspore.Tensor]] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    labels: Optional[mindspore.Tensor] = None,
    use_cache: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
    cache_position: Optional[mindspore.Tensor] = None,
) -> Union[Tuple, CausalLMOutputWithPast]:
    r"""
    Args:
        labels (`mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
            Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
            config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
            (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.

    Returns:
        `Union[Tuple, CausalLMOutputWithPast]`

    Example:
        ```python
        >>> from transformers import AutoTokenizer, CohereForCausalLM
        ...
        >>> model = CohereForCausalLM.from_pretrained("CohereForAI/c4ai-command-r-v01")
        >>> tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")
        ...
        >>> prompt = "Hey, are you conscious? Can you talk to me?"
        >>> inputs = tokenizer(prompt, return_tensors="pt")
        ...
        >>> # Generate
        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
        ```
    """
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
    outputs = self.model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        position_ids=position_ids,
        past_key_values=past_key_values,
        inputs_embeds=inputs_embeds,
        use_cache=use_cache,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
        cache_position=cache_position,
    )

    hidden_states = outputs[0]
    logits = self.lm_head(hidden_states)
    logits = logits * self.logit_scale
    logits = logits.float()

    loss = None
    if labels is not None:
        # Shift so that tokens < n predict n
        shift_logits = logits[..., :-1, :]
        shift_labels = labels[..., 1:]
        # Flatten the tokens
        shift_logits = shift_logits.view(-1, self.config.vocab_size)
        shift_labels = shift_labels.view(-1)
        # Enable model parallelism
        loss = ops.cross_entropy(shift_logits, shift_labels)

    if not return_dict:
        output = (logits,) + outputs[1:]
        return (loss,) + output if loss is not None else output

    return CausalLMOutputWithPast(
        loss=loss,
        logits=logits,
        past_key_values=outputs.past_key_values,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
    )

mindnlp.transformers.models.cohere.modeling_cohere.CohereLayerNorm

Bases: Module

Source code in mindnlp/transformers/models/cohere/modeling_cohere.py
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
class CohereLayerNorm(nn.Module):
    def __init__(self, hidden_size=None, eps=1e-5, bias=False):
        """The hidden size can be a tuple or an int. The tuple is used for QKNorm to normalize across head_dim"""
        super().__init__()
        self.weight = Parameter(ops.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states):
        input_dtype = hidden_states.dtype
        hidden_states = hidden_states.to(mindspore.float32)
        mean = hidden_states.mean(-1, keep_dims=True)
        variance = (hidden_states - mean).pow(2).mean(-1, keep_dims=True)
        hidden_states = (hidden_states - mean) * ops.rsqrt(variance + self.variance_epsilon)
        hidden_states = self.weight.to(mindspore.float32) * hidden_states
        return hidden_states.to(input_dtype)

mindnlp.transformers.models.cohere.modeling_cohere.CohereLayerNorm.__init__(hidden_size=None, eps=1e-05, bias=False)

The hidden size can be a tuple or an int. The tuple is used for QKNorm to normalize across head_dim

Source code in mindnlp/transformers/models/cohere/modeling_cohere.py
64
65
66
67
68
def __init__(self, hidden_size=None, eps=1e-5, bias=False):
    """The hidden size can be a tuple or an int. The tuple is used for QKNorm to normalize across head_dim"""
    super().__init__()
    self.weight = Parameter(ops.ones(hidden_size))
    self.variance_epsilon = eps

mindnlp.transformers.models.cohere.modeling_cohere.CohereModel

Bases: CoherePreTrainedModel

Transformer decoder consisting of config.num_hidden_layers layers. Each layer is a [CohereDecoderLayer]

PARAMETER DESCRIPTION
config

CohereConfig

TYPE: CohereConfig

Source code in mindnlp/transformers/models/cohere/modeling_cohere.py
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
class CohereModel(CoherePreTrainedModel):
    """
    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`CohereDecoderLayer`]

    Args:
        config: CohereConfig
    """

    # Ignore copy
    def __init__(self, config: CohereConfig):
        super().__init__(config)
        self.padding_idx = config.pad_token_id
        self.vocab_size = config.vocab_size

        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
        self.layers = nn.ModuleList(
            [CohereDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
        )
        self.norm = CohereLayerNorm(hidden_size=(config.hidden_size), eps=config.layer_norm_eps)
        self.gradient_checkpointing = False

        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        return self.embed_tokens

    def set_input_embeddings(self, value):
        self.embed_tokens = value

    def forward(
        self,
        input_ids: mindspore.Tensor = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        past_key_values: Optional[List[mindspore.Tensor]] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        cache_position: Optional[mindspore.Tensor] = None,
    ) -> Union[Tuple, BaseModelOutputWithPast]:
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        use_cache = use_cache if use_cache is not None else self.config.use_cache
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        if (input_ids is None) ^ (inputs_embeds is not None):
            raise ValueError(
                "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
            )

        if self.gradient_checkpointing and self.training and use_cache:
            logger.warning_once(
                "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
            )
            use_cache = False

        if inputs_embeds is None:
            inputs_embeds = self.embed_tokens(input_ids)

        past_seen_tokens = 0
        return_legacy_cache = False
        if use_cache and not isinstance(past_key_values, Cache):  # kept for BC (non `Cache` `past_key_values` inputs)
            return_legacy_cache = True
            past_key_values = DynamicCache.from_legacy_cache(past_key_values)

        if cache_position is None:
            past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
            cache_position = ops.arange(
                past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1]
            )

        if position_ids is None:
            position_ids = cache_position.unsqueeze(0)

        causal_mask = self._update_causal_mask(
            attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions
        )

        # embed positions
        hidden_states = inputs_embeds

        # decoder layers
        all_hidden_states = () if output_hidden_states else None
        all_self_attns = () if output_attentions else None
        next_decoder_cache = None

        for decoder_layer in self.layers:
            if output_hidden_states:
                all_hidden_states += (hidden_states,)

            if self.gradient_checkpointing and self.training:
                layer_outputs = self._gradient_checkpointing_func(
                    decoder_layer.__call__,
                    hidden_states,
                    causal_mask,
                    position_ids,
                    past_key_values,
                    output_attentions,
                    use_cache,
                    cache_position,
                )
            else:
                layer_outputs = decoder_layer(
                    hidden_states,
                    attention_mask=causal_mask,
                    position_ids=position_ids,
                    past_key_value=past_key_values,
                    output_attentions=output_attentions,
                    use_cache=use_cache,
                    cache_position=cache_position,
                )

            hidden_states = layer_outputs[0]
            if use_cache:
                next_decoder_cache = layer_outputs[2 if output_attentions else 1]

            if output_attentions:
                all_self_attns += (layer_outputs[1],)

        hidden_states = self.norm(hidden_states)

        # add hidden states from the last decoder layer
        if output_hidden_states:
            all_hidden_states += (hidden_states,)

        next_cache = next_decoder_cache if use_cache else None
        if return_legacy_cache:
            next_cache = next_cache.to_legacy_cache()

        if not return_dict:
            return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
        return BaseModelOutputWithPast(
            last_hidden_state=hidden_states,
            past_key_values=next_cache,
            hidden_states=all_hidden_states,
            attentions=all_self_attns,
        )

    def _update_causal_mask(
        self,
        attention_mask: mindspore.Tensor,
        input_tensor: mindspore.Tensor,
        cache_position: mindspore.Tensor,
        past_key_values: Cache,
        output_attentions: bool,
    ):
        # For SDPA, when possible, we will rely on its `is_causal` argument instead of its `attn_mask` argument, in
        # order to dispatch on Flash Attention 2. This feature is not compatible with static cache, as SDPA will fail
        # to infer the attention mask.
        past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
        using_static_cache = isinstance(past_key_values, StaticCache)

        dtype = input_tensor.dtype
        min_dtype = finfo(dtype, 'min')
        sequence_length = input_tensor.shape[1]
        if using_static_cache:
            target_length = past_key_values.get_max_length()
        else:
            target_length = (
                attention_mask.shape[-1]
                if isinstance(attention_mask, mindspore.Tensor)
                else past_seen_tokens + sequence_length + 1
            )

        if attention_mask is not None and attention_mask.dim() == 4:
            # in this case we assume that the mask comes already in inverted form and requires no inversion or slicing
            if attention_mask.max() != 0:
                raise ValueError("Custom 4D attention mask should be passed in inverted form with max==0`")
            causal_mask = attention_mask
        else:
            causal_mask = ops.full(
                (sequence_length, target_length), fill_value=min_dtype, dtype=dtype
            )
            if sequence_length != 1:
                causal_mask = ops.triu(causal_mask, diagonal=1)
            causal_mask *= ops.arange(target_length) > cache_position.reshape(-1, 1)

            causal_mask = causal_mask[None, None, :, :].expand(input_tensor.shape[0], 1, -1, -1)
            if attention_mask is not None:
                causal_mask = causal_mask.copy()  # copy to contiguous memory for in-place edit
                mask_length = attention_mask.shape[-1]
                padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :]
                padding_mask = padding_mask == 0
                causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
                    padding_mask, min_dtype
                )
        if (
            attention_mask is not None
            and not output_attentions
        ):
            # Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
            # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
            # Details: https://github.com/pytorch/pytorch/issues/110213
            causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
        return causal_mask

mindnlp.transformers.models.cohere.modeling_cohere.CoherePreTrainedModel

Bases: PreTrainedModel

Source code in mindnlp/transformers/models/cohere/modeling_cohere.py
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
class CoherePreTrainedModel(PreTrainedModel):
    config_class = CohereConfig
    base_model_prefix = "model"
    supports_gradient_checkpointing = True
    _no_split_modules = ["CohereDecoderLayer"]
    _skip_keys_device_placement = ["past_key_values"]
    _supports_flash_attn_2 = True
    _supports_sdpa = True
    _supports_cache_class = True
    _supports_static_cache = True

    def _init_weights(self, cell):
        """Initialize the weights"""
        if isinstance(cell, nn.Linear):
            # Slightly different from the TF version which uses truncated_normal for initialization
            # cf https://github.com/pytorch/pytorch/pull/5617
            cell.weight.set_data(initializer(Normal(self.config.initializer_range),
                                                    cell.weight.shape, cell.weight.dtype))
            if cell.bias is not None:
                cell.bias.set_data(initializer('zeros', cell.bias.shape, cell.bias.dtype))
        elif isinstance(cell, nn.Embedding):
            weight = initializer(Normal(self.config.initializer_range),
                                                 cell.weight.shape,
                                                 cell.weight.dtype)
            if cell.padding_idx is not None:
                weight[cell.padding_idx] = 0

            cell.weight.set_data(weight)

mindnlp.transformers.models.cohere.modeling_cohere.apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1)

Applies Rotary Position Embedding to the query and key tensors.

PARAMETER DESCRIPTION
q

The query tensor.

TYPE: `mindspore.Tensor`

k

The key tensor.

TYPE: `mindspore.Tensor`

cos

The cosine part of the rotary embedding.

TYPE: `mindspore.Tensor`

sin

The sine part of the rotary embedding.

TYPE: `mindspore.Tensor`

position_ids

Deprecated and unused.

TYPE: `mindspore.Tensor`, *optional* DEFAULT: None

unsqueeze_dim

The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.

TYPE: `int`, *optional*, defaults to 1 DEFAULT: 1

RETURNS DESCRIPTION

tuple(mindspore.Tensor) comprising of the query and key tensors rotated using the Rotary Position Embedding.

Source code in mindnlp/transformers/models/cohere/modeling_cohere.py
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
    """
    Applies Rotary Position Embedding to the query and key tensors.

    Args:
        q (`mindspore.Tensor`): The query tensor.
        k (`mindspore.Tensor`): The key tensor.
        cos (`mindspore.Tensor`): The cosine part of the rotary embedding.
        sin (`mindspore.Tensor`): The sine part of the rotary embedding.
        position_ids (`mindspore.Tensor`, *optional*):
            Deprecated and unused.
        unsqueeze_dim (`int`, *optional*, defaults to 1):
            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
            that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
            k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
            cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
            the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.

    Returns:
        `tuple(mindspore.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
    """
    dtype = q.dtype
    q = q.float()
    k = k.float()
    cos = cos.unsqueeze(unsqueeze_dim)
    sin = sin.unsqueeze(unsqueeze_dim)
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed.to(dtype=dtype), k_embed.to(dtype=dtype)

mindnlp.transformers.models.cohere.modeling_cohere.repeat_kv(hidden_states, n_rep)

This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch, num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)

Source code in mindnlp/transformers/models/cohere/modeling_cohere.py
163
164
165
166
167
168
169
170
171
172
def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor:
    """
    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
    """
    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
    if n_rep == 1:
        return hidden_states
    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)

mindnlp.transformers.models.cohere.tokenization_cohere_fast

cohere tokenization

mindnlp.transformers.models.cohere.tokenization_cohere_fast.CohereTokenizerFast

Bases: PreTrainedTokenizerFast

Construct a Cohere tokenizer. Based on byte-level Byte-Pair-Encoding.

This uses notably ByteFallback and NFC normalization.

Example
>>> from transformers import AutoTokenizer
...
>>> tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")
>>> tokenizer.encode("Hello this is a test")
[5, 28339, 2075, 1801, 1671, 3282]

If you want to change the bos_token or the eos_token, make sure to specify them when initializing the model, or call tokenizer.update_post_processor() to make sure that the post-processing is correctly done (otherwise the values of the first token and final token of an encoded sequence will not be correct). For more details, checkout [post-processors] (https://huggingface.co/docs/tokenizers/api/post-processors) documentation.

You can get around that behavior by passing add_prefix_space=True when instantiating this tokenizer, but since the model was not pretrained this way, it might yield a decrease in performance.

When used with is_split_into_words=True, this tokenizer needs to be instantiated with add_prefix_space=True.

This tokenizer inherits from [PreTrainedTokenizerFast] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

PARAMETER DESCRIPTION
vocab_file

Path to the vocabulary file.

TYPE: `str`, *optional* DEFAULT: None

merges_file

Path to the merges file.

TYPE: `str`, *optional* DEFAULT: None

tokenizer_file

tokenizers file (generally has a .json extension) that contains everything needed to load the tokenizer.

TYPE: `str`, *optional* DEFAULT: None

clean_up_tokenization_spaces

Whether or not to cleanup spaces after decoding, cleanup consists in removing potential artifacts like extra spaces.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

unk_token

The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

TYPE: `str` or `tokenizers.AddedToken`, *optional*, defaults to `"<UNK>"` DEFAULT: '<UNK>'

bos_token

The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.

TYPE: `str` or `tokenizers.AddedToken`, *optional*, defaults to `"<BOS_TOKEN>"` DEFAULT: '<BOS_TOKEN>'

eos_token

The end of sequence token.

TYPE: `str` or `tokenizers.AddedToken`, *optional*, defaults to `"<|END_OF_TURN_TOKEN|>"` DEFAULT: '<|END_OF_TURN_TOKEN|>'

add_bos_token

Whether or not to add an bos_token at the start of sequences.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

add_eos_token

Whether or not to add an eos_token at the end of sequences.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

use_default_system_prompt

Whether or not the default system prompt for Cohere tokenizer should be used.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

add_prefix_space

Whether or not the tokenizer should automatically add a prefix space

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

Source code in mindnlp/transformers/models/cohere/tokenization_cohere_fast.py
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
class CohereTokenizerFast(PreTrainedTokenizerFast):
    """
    Construct a Cohere tokenizer. Based on byte-level Byte-Pair-Encoding.

    This uses notably ByteFallback and NFC normalization.

    Example:
        ```python
        >>> from transformers import AutoTokenizer
        ...
        >>> tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")
        >>> tokenizer.encode("Hello this is a test")
        [5, 28339, 2075, 1801, 1671, 3282]
        ```

    If you want to change the `bos_token` or the `eos_token`, make sure to specify them when initializing the model, or
    call `tokenizer.update_post_processor()` to make sure that the post-processing is correctly done (otherwise the
    values of the first token and final token of an encoded sequence will not be correct). For more details, checkout
    [post-processors] (https://huggingface.co/docs/tokenizers/api/post-processors) documentation.

    You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer, but since
    the model was not pretrained this way, it might yield a decrease in performance.

    <Tip>

    When used with `is_split_into_words=True`, this tokenizer needs to be instantiated with `add_prefix_space=True`.

    </Tip>

    This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
    refer to this superclass for more information regarding those methods.

    Args:
        vocab_file (`str`, *optional*):
            Path to the vocabulary file.
        merges_file (`str`, *optional*):
            Path to the merges file.
        tokenizer_file (`str`, *optional*):
            [tokenizers](https://github.com/huggingface/tokenizers) file (generally has a .json extension) that
            contains everything needed to load the tokenizer.
        clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`):
            Whether or not to cleanup spaces after decoding, cleanup consists in removing potential artifacts like
            extra spaces.
        unk_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<UNK>"`):
            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
            token instead.
        bos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<BOS_TOKEN>"`):
            The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
        eos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<|END_OF_TURN_TOKEN|>"`):
            The end of sequence token.
        add_bos_token (`bool`, *optional*, defaults to `True`):
            Whether or not to add an `bos_token` at the start of sequences.
        add_eos_token (`bool`, *optional*, defaults to `False`):
            Whether or not to add an `eos_token` at the end of sequences.
        use_default_system_prompt (`bool`, *optional*, defaults to `False`):
            Whether or not the default system prompt for Cohere tokenizer should be used.
        add_prefix_space (`bool`, *optional*, defaults to `False`):
            Whether or not the tokenizer should automatically add a prefix space
    """

    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
    padding_side = "left"
    model_input_names = ["input_ids", "attention_mask"]
    slow_tokenizer_class = None
    # No `max_model_input_sizes`

    def __init__(
        self,
        vocab_file=None,
        merges_file=None,
        tokenizer_file=None,
        clean_up_tokenization_spaces=False,
        unk_token="<UNK>",
        bos_token="<BOS_TOKEN>",
        eos_token="<|END_OF_TURN_TOKEN|>",
        add_bos_token=True,
        add_eos_token=False,
        use_default_system_prompt=False,
        add_prefix_space=False,
        **kwargs,
    ):
        super().__init__(
            vocab_file=vocab_file,
            merges_file=merges_file,
            tokenizer_file=tokenizer_file,
            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
            unk_token=unk_token,
            bos_token=bos_token,
            eos_token=eos_token,
            add_bos_token=add_bos_token,
            add_eos_token=add_eos_token,
            use_default_system_prompt=use_default_system_prompt,
            add_prefix_space=add_prefix_space,
            **kwargs,
        )
        self._add_bos_token = add_bos_token
        self._add_eos_token = add_eos_token
        self.update_post_processor()
        self.use_default_system_prompt = use_default_system_prompt
        self.vocab_file = vocab_file
        self.grounded_generation_template = kwargs.pop("grounded_generation_template", None)
        self.tool_use_template = kwargs.pop("tool_use_template", None)

        # TODO @ArthurZucker this can only work one way for now, to update later-on. Tests should also properly
        # check this as they were green before.
        pre_tok_state = pickle.dumps(self.backend_tokenizer.pre_tokenizer)
        decoder_state = pickle.dumps(self.backend_tokenizer.decoder)

        if add_prefix_space:
            pre_tok_state = pre_tok_state.replace(b'"add_prefix_space":false', b'"add_prefix_space": true')
            decoder_state = decoder_state.replace(b'"add_prefix_space":false', b'"add_prefix_space": true')
        self.backend_tokenizer.pre_tokenizer = pickle.loads(pre_tok_state)
        self.backend_tokenizer.decoder = pickle.loads(decoder_state)

        self.add_prefix_space = add_prefix_space

    def _batch_encode_plus(self, *args, **kwargs) -> BatchEncoding:
        is_split_into_words = kwargs.get("is_split_into_words", False)
        if not (self.add_prefix_space or not is_split_into_words):
            raise Exception(
                f"You need to instantiate {self.__class__.__name__} with add_prefix_space=True to use it with"
                " pretokenized inputs."
            )

        return super()._batch_encode_plus(*args, **kwargs)

    def _encode_plus(self, *args, **kwargs) -> BatchEncoding:
        is_split_into_words = kwargs.get("is_split_into_words", False)

        if not (self.add_prefix_space or not is_split_into_words):
            raise Exception(
                f"You need to instantiate {self.__class__.__name__} with add_prefix_space=True to use it with"
                " pretokenized inputs."
            )

        return super()._encode_plus(*args, **kwargs)

    def update_post_processor(self):
        """
        Updates the underlying post processor with the current `bos_token` and `eos_token`.
        """
        bos = self.bos_token
        bos_token_id = self.bos_token_id
        if bos is None and self.add_bos_token:
            raise ValueError("add_bos_token = True but bos_token = None")

        eos = self.eos_token
        eos_token_id = self.eos_token_id
        if eos is None and self.add_eos_token:
            raise ValueError("add_eos_token = True but eos_token = None")

        single = f"{(bos+':0 ') if self.add_bos_token else ''}$A:0{(' '+eos+':0') if self.add_eos_token else ''}"
        pair = f"{single}{(' '+bos+':1') if self.add_bos_token else ''} $B:1{(' '+eos+':1') if self.add_eos_token else ''}"

        special_tokens = []
        if self.add_bos_token:
            special_tokens.append((bos, bos_token_id))
        if self.add_eos_token:
            special_tokens.append((eos, eos_token_id))
        self._tokenizer.post_processor = processors.TemplateProcessing(
            single=single, pair=pair, special_tokens=special_tokens
        )

    @property
    def add_eos_token(self):
        return self._add_eos_token

    @property
    def add_bos_token(self):
        return self._add_bos_token

    @add_eos_token.setter
    def add_eos_token(self, value):
        self._add_eos_token = value
        self.update_post_processor()

    @add_bos_token.setter
    def add_bos_token(self, value):
        self._add_bos_token = value
        self.update_post_processor()

    @property
    def default_chat_template(self):
        """
        Cohere Tokenizer uses <|START_OF_TURN_TOKEN|> and <|END_OF_TURN_TOKEN|> to indicate each turn in a chat.
        Additioanlly, to indicate the source of the message, <|USER_TOKEN|>, <|CHATBOT_TOKEN|> and <|SYSTEM_TOKEN|>
        for user, assitant and system messages respectively.

        The output should look something like:

        ```<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>{{ preamble }}<|END_OF_TURN_TOKEN|><BOS_TOKEN><|START_OF_TURN_TOKEN|>
        <|USER_TOKEN|>{{ How are you? }}<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
        {{ I am doing well! }}<|END_OF_TURN_TOKEN|>```

        Use add_generation_prompt to add a prompt for the model to generate a response:

        Example:
            ```python
            >>> from transformers import AutoTokenizer
            >>> tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")
            >>> messages = [{"role": "user", "content": "Hello, how are you?"}]
            >>> tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
            '<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'
            ```
        """
        default_template = (
            "{{ bos_token }}"
            "{% if messages[0]['role'] == 'system' %}"
            "{% set loop_messages = messages[1:] %}"  # Extract system message if it's present
            "{% set system_message = messages[0]['content'] %}"
            "{% elif USE_DEFAULT_PROMPT == true %}"
            "{% set loop_messages = messages %}"  # Or use the default system message if the flag is set
            "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
            "{% else %}"
            "{% set loop_messages = messages %}"
            "{% set system_message = false %}"
            "{% endif %}"
            "{% if system_message != false %}"  # Start with system message
            "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + system_message + '<|END_OF_TURN_TOKEN|>' }}"
            "{% endif %}"
            "{% for message in loop_messages %}"  # Loop over all non-system messages
            "{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}"
            "{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}"
            "{% endif %}"
            "{% set content = message['content'] %}"
            "{% if message['role'] == 'user' %}"  # After all of that, handle messages/roles in a fairly normal way
            "{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% elif message['role'] == 'assistant' %}"
            "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'  + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% endif %}"
            "{% endfor %}"
            "{% if add_generation_prompt %}"
            "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}"
            "{% endif %}"
        )
        default_template = default_template.replace(
            "USE_DEFAULT_PROMPT", "true" if self.use_default_system_prompt else "false"
        )
        default_message = DEFAULT_SYSTEM_PROMPT.replace("\n", "\\n").replace("'", "\\'")
        default_template = default_template.replace("DEFAULT_SYSTEM_MESSAGE", default_message)

        tool_use_template = (
            "{{ bos_token }}"
            "{% if messages[0]['role'] == 'system' %}"
            "{% set loop_messages = messages[1:] %}"  # Extract system message if it's present
            "{% set system_message = messages[0]['content'] %}"
            "{% else %}"
            "{% set loop_messages = messages %}"
            "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
            "{% endif %}"
            "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}"
            "{{ '# Safety Preamble' }}"
            "{{ '\nThe instructions in this section override those in the task description and style guide sections. Don\\'t answer questions that are harmful or immoral.' }}"
            "{{ '\n\n# System Preamble' }}"
            "{{ '\n## Basic Rules' }}"
            "{{ '\nYou are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user\\'s requests, you cite your sources in your answers, according to those instructions.' }}"
            "{{ '\n\n# User Preamble' }}"
            "{{ '\n' + system_message }}"
            "{{'\n\n## Available Tools\nHere is a list of tools that you have available to you:\n\n'}}"
            "{% for tool in tools %}"
            "{% if loop.index0 != 0 %}"
            "{{ '\n\n'}}"
            "{% endif %}"
            "{{'```python\ndef ' + tool.name + '('}}"
            "{% for param_name, param_fields in tool.parameter_definitions.items() %}"
            "{% if loop.index0 != 0 %}"
            "{{ ', '}}"
            "{% endif %}"
            "{{param_name}}: "
            "{% if not param_fields.required %}"
            "{{'Optional[' + param_fields.type + '] = None'}}"
            "{% else %}"
            "{{ param_fields.type }}"
            "{% endif %}"
            "{% endfor %}"
            '{{ \') -> List[Dict]:\n    """\'}}'
            "{{ tool.description }}"
            "{% if tool.parameter_definitions|length != 0 %}"
            "{{ '\n\n    Args:\n        '}}"
            "{% for param_name, param_fields in tool.parameter_definitions.items() %}"
            "{% if loop.index0 != 0 %}"
            "{{ '\n        ' }}"
            "{% endif %}"
            "{{ param_name + ' ('}}"
            "{% if not param_fields.required %}"
            "{{'Optional[' + param_fields.type + ']'}}"
            "{% else %}"
            "{{ param_fields.type }}"
            "{% endif %}"
            "{{ '): ' + param_fields.description }}"
            "{% endfor %}"
            "{% endif %}"
            '{{ \'\n    """\n    pass\n```\' }}'
            "{% endfor %}"
            "{{ '<|END_OF_TURN_TOKEN|>'}}"
            "{% for message in loop_messages %}"
            "{% set content = message['content'] %}"
            "{% if message['role'] == 'user' %}"
            "{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% elif message['role'] == 'system' %}"
            "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% elif message['role'] == 'assistant' %}"
            "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'  + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% endif %}"
            "{% endfor %}"
            "{{'<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Write \\'Action:\\' followed by a json-formatted list of actions that you want to perform in order to produce a good response to the user\\'s last input. You can use any of the supplied tools any number of times, but you should aim to execute the minimum number of necessary actions for the input. You should use the `directly-answer` tool if calling the other tools is unnecessary. The list of actions you want to call should be formatted as a list of json objects, for example:\n```json\n[\n    {\n        \"tool_name\": title of the tool in the specification,\n        \"parameters\": a dict of parameters to input into the tool as they are defined in the specs, or {} if it takes no parameters\n    }\n]```<|END_OF_TURN_TOKEN|>'}}"
            "{% if add_generation_prompt %}"
            "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}"
            "{% endif %}"
        )
        default_tool_message = DEFAULT_RAG_PREAMBLE.replace("\n", "\\n").replace("'", "\\'")
        tool_use_template = tool_use_template.replace("DEFAULT_SYSTEM_MESSAGE", default_tool_message)

        rag_template = (
            "{{ bos_token }}"
            "{% if messages[0]['role'] == 'system' %}"
            "{% set loop_messages = messages[1:] %}"  # Extract system message if it's present
            "{% set system_message = messages[0]['content'] %}"
            "{% else %}"
            "{% set loop_messages = messages %}"
            "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
            "{% endif %}"
            "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}"
            "{{ '# Safety Preamble' }}"
            "{{ '\nThe instructions in this section override those in the task description and style guide sections. Don\\'t answer questions that are harmful or immoral.' }}"
            "{{ '\n\n# System Preamble' }}"
            "{{ '\n## Basic Rules' }}"
            "{{ '\nYou are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user\\'s requests, you cite your sources in your answers, according to those instructions.' }}"
            "{{ '\n\n# User Preamble' }}"
            "{{ '\n' + system_message }}"
            "{{ '<|END_OF_TURN_TOKEN|>'}}"
            "{% for message in loop_messages %}"  # Loop over all non-system messages
            "{% set content = message['content'] %}"
            "{% if message['role'] == 'user' %}"  # After all of that, handle messages/roles in a fairly normal way
            "{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% elif message['role'] == 'system' %}"
            "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% elif message['role'] == 'assistant' %}"
            "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'  + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% endif %}"
            "{% endfor %}"
            "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>'}}"
            "{{ '<results>' }}"
            "{% for document in documents %}"  # Loop over all non-system messages
            "{{ '\nDocument: ' }}"
            "{{ loop.index0 }}\n"
            "{% for key, value in document.items() %}"
            "{{ key }}: {{value}}\n"
            "{% endfor %}"
            "{% endfor %}"
            "{{ '</results>'}}"
            "{{ '<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}"
            "{{ 'Carefully perform the following instructions, in order, starting each with a new line.\n' }}"
            "{{ 'Firstly, Decide which of the retrieved documents are relevant to the user\\'s last input by writing \\'Relevant Documents:\\' followed by comma-separated list of document numbers. If none are relevant, you should instead write \\'None\\'.\n' }}"
            "{{ 'Secondly, Decide which of the retrieved documents contain facts that should be cited in a good answer to the user\\'s last input by writing \\'Cited Documents:\\' followed a comma-separated list of document numbers. If you dont want to cite any of them, you should instead write \\'None\\'.\n' }}"
            "{% if citation_mode=='accurate' %}"
            "{{ 'Thirdly, Write \\'Answer:\\' followed by a response to the user\\'s last input in high quality natural english. Use the retrieved documents to help you. Do not insert any citations or grounding markup.\n' }}"
            "{% endif %}"
            "{{ 'Finally, Write \\'Grounded answer:\\' followed by a response to the user\\'s last input in high quality natural english. Use the symbols <co: doc> and </co: doc> to indicate when a fact comes from a document in the search result, e.g <co: 0>my fact</co: 0> for a fact from document 0.' }}"
            "{{ '<|END_OF_TURN_TOKEN|>' }}"
            "{% if add_generation_prompt %}"
            "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}"
            "{% endif %}"
        )
        default_rag_message = DEFAULT_RAG_PREAMBLE.replace("\n", "\\n").replace("'", "\\'")
        rag_template = rag_template.replace("DEFAULT_SYSTEM_MESSAGE", default_rag_message)

        return {"default": default_template, "tool_use": tool_use_template, "rag": rag_template}

    def apply_tool_use_template(
        self,
        conversation: Union[List[Dict[str, str]], "Conversation"],
        tools: List[Dict],
        **kwargs,
    ) -> Union[str, List[int]]:
        """Create a Command-R tool-use prompt.

        Once rendered, the prompt instructs the model to generate a list of actions to perform on a set of user supplied tools
        to help carry out the user's requests.

        Conceptually, this works in the same way as `apply_chat_format`, but takes an additional `tools` parameter.

        Converts a Conversation object or a list of dictionaries with `"role"` and `"content"` keys and a list of available
        tools for the model to use into a prompt string, or a list of token ids.
        This method will use the tokenizer's `default_tool_use_template` template specified at the class level.
        You can override the default template using the `tool_use_template` kwarg but the quality of your results may decrease.

        Args:
            conversation (Union[List[Dict[str, str]], "Conversation"]): A Conversation object or list of dicts
                with "role" and "content" keys, representing the chat history so far.
            tools (List[Dict]): a list of tools to render into the prompt for the model to choose from.
                See an example at the bottom of the docstring. The format should be:

                - name (str): The name of the tool to be called. Valid names contain only the characters a-z,
                A-Z, 0-9, _ and must not begin with a digit.
                - description (str): The description of what the tool does, the model uses the description to
                choose when and how to call the function.
                - parameter_definitions (List[Dict]): The input parameters of the tool. Accepts a dictionary
                where the key is the name of the parameter and the value is the parameter spec.
                Valid parameter names contain only the characters a-z, A-Z, 0-9, _ and must not begin with a digit.
                Parameter specs are as follows:

                    - description (str): The description of the parameter.
                    - type (str): the type of the parameter - most effective for python builtin data types, such as 'str', 'bool'
                    - required: boolean: Denotes whether the parameter is always present (required) or not. Defaults to not required.
            add_generation_prompt (bool, *optional*): Whether to end the prompt with the token(s) that indicate
                the start of an assistant message. This is useful when you want to generate a response from the model.
                Note that this argument will be passed to the chat template, and so it must be supported in the
                template for this argument to have any effect.
            tokenize (`bool`, defaults to `True`):
                Whether to tokenize the output. If `False`, the output will be a string.
            padding (`bool`, defaults to `False`):
                Whether to pad sequences to the maximum length. Has no effect if tokenize is `False`.
            truncation (`bool`, defaults to `False`):
                Whether to truncate sequences at the maximum length. Has no effect if tokenize is `False`.
            max_length (`int`, *optional*):
                Maximum length (in tokens) to use for padding or truncation. Has no effect if tokenize is `False`. If
                not specified, the tokenizer's `max_length` attribute will be used as a default.
            return_tensors (`str` or [`~utils.TensorType`], *optional*):
                If set, will return tensors of a particular framework. Has no effect if tokenize is `False`. Acceptable
                values are:

                - `'tf'`: Return TensorFlow `tf.Tensor` objects.
                - `'pt'`: Return PyTorch `torch.Tensor` objects.
                - `'np'`: Return NumPy `np.ndarray` objects.
                - `'jax'`: Return JAX `jnp.ndarray` objects.
            return_dict (`bool`, *optional*, defaults to `False`):
                Whether to return a dictionary with named outputs. Has no effect if tokenize is `False`.
            **tokenizer_kwargs:
                Additional kwargs to pass to the tokenizer.

        Returns:
            Conditional return:

                - `str`: A rendered prompt string.
                - if tokenize=True:
                `List[int]`: A list of token ids representing the tokenized chat so far, including control tokens. This
                output is ready to pass to the model, either directly or via methods like `generate()`.

        Example:
            ```python
            >>> tokenizer = CohereTokenizerFast.from_pretrained("CohereForAI/c4ai-command-r-v01")
            >>> tools = [
            ...     {
            ...         "name": "internet_search",
            ...         "description": "Returns a list of relevant document snippets for a textual query retrieved from the internet",
            ...         "parameter_definitions": {
            ...             "query": {
            ...                 "description": "Query to search the internet with",
            ...                 "type": "str",
            ...                 "required": True
            ...             }
            ...         }
            ...     },
            ...     {
            ...         "name': "directly_answer",
            ...         "description": "Calls a standard (un-augmented) AI chatbot to generate a response given the conversation history",
            ...         "parameter_definitions": {}
            ...     }
            ... ]
            >>> conversation = [
            ...     {"role": "user", "content": "Whats the biggest penguin in the world?"}
            ... ]
            >>> # render the prompt, ready for user to inspect, or for input into the model:
            >>> prompt = tokenizer.apply_tool_use_template(conversation, tools=tools, tokenize=False, add_generation_prompt=True)
            >>> print(prompt)
            <BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|># Safety Preamble
            The instructions in this section override those in the task description and style guide sections. Don't answer questions that are harmful or immoral.

            # System Preamble
            ## Basic Rules
            You are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user's requests, you cite your sources in your answers, according to those instructions.

            # User Preamble
            ## Task and Context
            You help people answer their questions and other requests interactively. You will be asked a very wide array of requests on all kinds of topics. You will be equipped with a wide range of search engines or similar tools to help you, which you use to research your answer. You should focus on serving the user's needs as best you can, which will be wide-ranging.

            ## Style Guide
            Unless the user asks for a different style of answer, you should answer in full sentences, using proper grammar and spelling.

            ## Available Tools
            Here is a list of tools that you have available to you:

            \\`\\`\\`python
            def internet_search(query: str) -> List[Dict]:
                \"\"\"Returns a list of relevant document snippets for a textual query retrieved from the internet

                Args:
                    query (str): Query to search the internet with
                \"\"\"
                pass
            \\`\\`\\`

            \\`\\`\\`python
            def directly_answer() -> List[Dict]:
                \"\"\"Calls a standard (un-augmented) AI chatbot to generate a response given the conversation history
                \"\"\"
                pass
            \\`\\`\\`<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Whats the biggest penguin in the world?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Write 'Action:' followed by a json-formatted list of actions that you want to perform in order to produce a good response to the user's last input. You can use any of the supplied tools any number of times, but you should aim to execute the minimum number of necessary actions for the input. You should use the `directly-answer` tool if calling the other tools is unnecessary. The list of actions you want to call should be formatted as a list of json objects, for example:
            \\`\\`\\`json
            [
                {
                    "tool_name": title of the tool in the specification,
                    "parameters": a dict of parameters to input into the tool as they are defined in the specs, or {} if it takes no parameters
                }
            ]\\`\\`\\`<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
            ```

            ```python
            >>> inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors='pt')
            >>> outputs = model.generate(inputs, max_new_tokens=128)
            >>> print(tokenizer.decode(outputs[0]))
            ```

            Action: ```json
            [
                {
                    "tool_name": "internet_search",
                    "parameters": {
                        "query": "biggest penguin in the world"
                    }
                }
            ]
            ```
        """
        return self.apply_chat_template(
            conversation,
            chat_template="tool_use",
            tools=tools,
            **kwargs,
        )

    def apply_grounded_generation_template(
        self,
        conversation: Union[List[Dict[str, str]], "Conversation"],
        documents: List[Dict],
        citation_mode: Literal["fast", "accurate"] = "accurate",
        **kwargs,
    ) -> Union[str, List[int]]:
        """Create a Command-R grounded generation (aka RAG) prompt.

        Once rendered, the prompt instructs the model to generate a response with citations in, based on supplied documents.

        Conceptually, this works in the same way as `apply_chat_format`, but takes additional `documents`
        and parameter `citation_mode` parameters.

        Converts a Conversation object or a list of dictionaries with `"role"` and `"content"` keys and a list of
        documents for the model to ground its response on into a prompt string, or a list of token ids.
        This method will use the tokenizer's `grounded_generation_template` template specified at the class level.
        You can override the default template using the `grounded_generation_template` kwarg but the quality of your results may decrease.

        Args:
            conversation (Union[List[Dict[str, str]], "Conversation"]): A Conversation object or list of dicts
                with "role" and "content" keys, representing the chat history so far.
            documents (List[Dict[str, str]): A list of dicts, representing documents or tool outputs to ground your
                generation on. A document is a semistructured dict, wiht a string to string mapping. Common fields are
                `url`, `title`, `snippet` etc but should be descriptive of the key. They will get rendered into the prompt.
            citation_mode: either "accurate" (prompt the model to generate an answer first, then rewrite it with citation
                spans in) or "fast", where the prompt instructs the model to generate an answer with citations in directly.
                The former has higher quality citations, the latter requires fewer tokens to be generated.
            add_generation_prompt (bool, *optional*): Whether to end the prompt with the token(s) that indicate
                the start of an assistant message. This is useful when you want to generate a response from the model.
                Note that this argument will be passed to the chat template, and so it must be supported in the
                template for this argument to have any effect.
            tokenize (`bool`, defaults to `True`):
                Whether to tokenize the output. If `False`, the output will be a string.
            padding (`bool`, defaults to `False`):
                Whether to pad sequences to the maximum length. Has no effect if tokenize is `False`.
            truncation (`bool`, defaults to `False`):
                Whether to truncate sequences at the maximum length. Has no effect if tokenize is `False`.
            max_length (`int`, *optional*):
                Maximum length (in tokens) to use for padding or truncation. Has no effect if tokenize is `False`. If
                not specified, the tokenizer's `max_length` attribute will be used as a default.
            return_tensors (`str` or [`~utils.TensorType`], *optional*):
                If set, will return tensors of a particular framework. Has no effect if tokenize is `False`. Acceptable
                values are:

                - `'tf'`: Return TensorFlow `tf.Tensor` objects.
                - `'pt'`: Return PyTorch `torch.Tensor` objects.
                - `'np'`: Return NumPy `np.ndarray` objects.
                - `'jax'`: Return JAX `jnp.ndarray` objects.
            return_dict (`bool`, *optional*, defaults to `False`):
                Whether to return a dictionary with named outputs. Has no effect if tokenize is `False`.
            **tokenizer_kwargs: Additional kwargs to pass to the tokenizer.

        Returns:
            Conditional return:

                - `str`: A rendered prompt string.
                - or if tokenize=True:
                - `List[int]`: A list of token ids representing the tokenized chat so far, including control tokens. This
                    output is ready to pass to the model, either directly or via methods like `generate()`.

        Example:
            ```python
            >>> tokenizer = CohereTokenizerFast.from_pretrained('CohereForAI/c4ai-command-r-v01')
            ...
            >>> # define documents:
            >>> documents = [
                { "title": "Tall penguins", "text": "Emperor penguins are the tallest." },
                { "title": "Penguin habitats", "text": "Emperor penguins only live in Antarctica."}
            ]
            >>> # define a conversation:
            >>> conversation = [
                {"role": "user", "content": "Whats the biggest penguin in the world?"}
            ]
            >>> # render the prompt, ready for user to inspect, or for input into the model:
            >>> grounded_generation_prompt = tokenizer.apply_grounded_generation_template(conversation, documents=documents, tokenize=False, add_generation_prompt=True)
            >>> print(grounded_generation_prompt)
            <BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|># Safety Preamble
            The instructions in this section override those in the task description and style guide sections. Don't answer questions that are harmful or immoral.

            ## Basic Rules
            You are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user's requests, you cite your sources in your answers, according to those instructions.

            # User Preamble
            ## Task and Context
            You help people answer their questions and other requests interactively. You will be asked a very wide array of requests on all kinds of topics. You will be equipped with a wide range of search engines or similar tools to help you, which you use to research your answer. You should focus on serving the user's needs as best you can, which will be wide-ranging.

            ## Style Guide
            Unless the user asks for a different style of answer, you should answer in full sentences, using proper grammar and spelling.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Whats the biggest penguin in the world?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|><results>
            Document: 0
            title: Tall penguins
            text: Emperor penguins are the tallest.

            Document: 1
            title: Penguin habitats
            text: Emperor penguins only live in Antarctica.
            </results><|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Carefully perform the following instructions, in order, starting each with a new line.
            Firstly, Decide which of the retrieved documents are relevant to the user's last input by writing 'Relevant Documents:' followed by comma-separated list of document numbers. If none are relevant, you should instead write 'None'.
            Secondly, Decide which of the retrieved documents contain facts that should be cited in a good answer to the user's last input by writing 'Cited Documents:' followed a comma-separated list of document numbers. If you dont want to cite any of them, you should instead write 'None'.
            Thirdly, Write 'Answer:' followed by a response to the user's last input in high quality natural english. Use the retrieved documents to help you. Do not insert any citations or grounding markup.
            Finally, Write 'Grounded answer:' followed by a response to the user's last input in high quality natural english. Use the symbols <co: doc> and </co: doc> to indicate when a fact comes from a document in the search result, e.g <co: 0>my fact</co: 0> for a fact from document 0.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'''
            >>> inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors='pt')
            >>> outputs = model.generate(inputs, max_new_tokens=128)
            >>> print(tokenizer.decode(outputs[0]))
            Relevant Documents: 0,1
            Cited Documents: 0,1
            Answer: The Emperor Penguin is the tallest or biggest penguin in the world. It is a bird that lives only in Antarctica and grows to a height of around 122 centimetres.
            Grounded answer: The <co: 0>Emperor Penguin</co: 0> is the <co: 0>tallest</co: 0> or biggest penguin in the world. It is a bird that <co: 1>lives only in Antarctica</co: 1> and <co: 0>grows to a height of around 122 centimetres.</co: 0>
            ```
        """
        return self.apply_chat_template(
            conversation,
            chat_template="rag",
            documents=documents,
            citation_mode=citation_mode,
            **kwargs,
        )

    # TODO ArthurZ let's rely on the template processor instead, refactor all fast tokenizers
    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
        bos_token_id = [self.bos_token_id] if self.add_bos_token else []
        eos_token_id = [self.eos_token_id] if self.add_eos_token else []

        output = bos_token_id + token_ids_0 + eos_token_id

        if token_ids_1 is not None:
            output = output + bos_token_id + token_ids_1 + eos_token_id

        return output

mindnlp.transformers.models.cohere.tokenization_cohere_fast.CohereTokenizerFast.default_chat_template property

Cohere Tokenizer uses <|START_OF_TURN_TOKEN|> and <|END_OF_TURN_TOKEN|> to indicate each turn in a chat. Additioanlly, to indicate the source of the message, <|USER_TOKEN|>, <|CHATBOT_TOKEN|> and <|SYSTEM_TOKEN|> for user, assitant and system messages respectively.

The output should look something like:

<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>{{ preamble }}<|END_OF_TURN_TOKEN|><BOS_TOKEN><|START_OF_TURN_TOKEN|> <|USER_TOKEN|>{{ How are you? }}<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|> {{ I am doing well! }}<|END_OF_TURN_TOKEN|>

Use add_generation_prompt to add a prompt for the model to generate a response:

Example:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")
>>> messages = [{"role": "user", "content": "Hello, how are you?"}]
>>> tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
'<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'

mindnlp.transformers.models.cohere.tokenization_cohere_fast.CohereTokenizerFast.apply_grounded_generation_template(conversation, documents, citation_mode='accurate', **kwargs)

Create a Command-R grounded generation (aka RAG) prompt.

Once rendered, the prompt instructs the model to generate a response with citations in, based on supplied documents.

Conceptually, this works in the same way as apply_chat_format, but takes additional documents and parameter citation_mode parameters.

Converts a Conversation object or a list of dictionaries with "role" and "content" keys and a list of documents for the model to ground its response on into a prompt string, or a list of token ids. This method will use the tokenizer's grounded_generation_template template specified at the class level. You can override the default template using the grounded_generation_template kwarg but the quality of your results may decrease.

PARAMETER DESCRIPTION
conversation

A Conversation object or list of dicts with "role" and "content" keys, representing the chat history so far.

TYPE: Union[List[Dict[str, str]], Conversation]

documents

A list of dicts, representing documents or tool outputs to ground your generation on. A document is a semistructured dict, wiht a string to string mapping. Common fields are url, title, snippet etc but should be descriptive of the key. They will get rendered into the prompt.

TYPE: List[Dict[str, str]

citation_mode

either "accurate" (prompt the model to generate an answer first, then rewrite it with citation spans in) or "fast", where the prompt instructs the model to generate an answer with citations in directly. The former has higher quality citations, the latter requires fewer tokens to be generated.

TYPE: Literal['fast', 'accurate'] DEFAULT: 'accurate'

add_generation_prompt

Whether to end the prompt with the token(s) that indicate the start of an assistant message. This is useful when you want to generate a response from the model. Note that this argument will be passed to the chat template, and so it must be supported in the template for this argument to have any effect.

TYPE: bool, *optional*

tokenize

Whether to tokenize the output. If False, the output will be a string.

TYPE: `bool`, defaults to `True`

padding

Whether to pad sequences to the maximum length. Has no effect if tokenize is False.

TYPE: `bool`, defaults to `False`

truncation

Whether to truncate sequences at the maximum length. Has no effect if tokenize is False.

TYPE: `bool`, defaults to `False`

max_length

Maximum length (in tokens) to use for padding or truncation. Has no effect if tokenize is False. If not specified, the tokenizer's max_length attribute will be used as a default.

TYPE: `int`, *optional*

return_tensors

If set, will return tensors of a particular framework. Has no effect if tokenize is False. Acceptable values are:

  • 'tf': Return TensorFlow tf.Tensor objects.
  • 'pt': Return PyTorch torch.Tensor objects.
  • 'np': Return NumPy np.ndarray objects.
  • 'jax': Return JAX jnp.ndarray objects.

TYPE: `str` or [`~utils.TensorType`], *optional*

return_dict

Whether to return a dictionary with named outputs. Has no effect if tokenize is False.

TYPE: `bool`, *optional*, defaults to `False`

**tokenizer_kwargs

Additional kwargs to pass to the tokenizer.

RETURNS DESCRIPTION
Union[str, List[int]]

Conditional return:

  • str: A rendered prompt string.
  • or if tokenize=True:
  • List[int]: A list of token ids representing the tokenized chat so far, including control tokens. This output is ready to pass to the model, either directly or via methods like generate().
Example
>>> tokenizer = CohereTokenizerFast.from_pretrained('CohereForAI/c4ai-command-r-v01')
...
>>> # define documents:
>>> documents = [
    { "title": "Tall penguins", "text": "Emperor penguins are the tallest." },
    { "title": "Penguin habitats", "text": "Emperor penguins only live in Antarctica."}
]
>>> # define a conversation:
>>> conversation = [
    {"role": "user", "content": "Whats the biggest penguin in the world?"}
]
>>> # render the prompt, ready for user to inspect, or for input into the model:
>>> grounded_generation_prompt = tokenizer.apply_grounded_generation_template(conversation, documents=documents, tokenize=False, add_generation_prompt=True)
>>> print(grounded_generation_prompt)
<BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|># Safety Preamble
The instructions in this section override those in the task description and style guide sections. Don't answer questions that are harmful or immoral.

## Basic Rules
You are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user's requests, you cite your sources in your answers, according to those instructions.

# User Preamble
## Task and Context
You help people answer their questions and other requests interactively. You will be asked a very wide array of requests on all kinds of topics. You will be equipped with a wide range of search engines or similar tools to help you, which you use to research your answer. You should focus on serving the user's needs as best you can, which will be wide-ranging.

## Style Guide
Unless the user asks for a different style of answer, you should answer in full sentences, using proper grammar and spelling.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Whats the biggest penguin in the world?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|><results>
Document: 0
title: Tall penguins
text: Emperor penguins are the tallest.

Document: 1
title: Penguin habitats
text: Emperor penguins only live in Antarctica.
</results><|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Carefully perform the following instructions, in order, starting each with a new line.
Firstly, Decide which of the retrieved documents are relevant to the user's last input by writing 'Relevant Documents:' followed by comma-separated list of document numbers. If none are relevant, you should instead write 'None'.
Secondly, Decide which of the retrieved documents contain facts that should be cited in a good answer to the user's last input by writing 'Cited Documents:' followed a comma-separated list of document numbers. If you dont want to cite any of them, you should instead write 'None'.
Thirdly, Write 'Answer:' followed by a response to the user's last input in high quality natural english. Use the retrieved documents to help you. Do not insert any citations or grounding markup.
Finally, Write 'Grounded answer:' followed by a response to the user's last input in high quality natural english. Use the symbols <co: doc> and </co: doc> to indicate when a fact comes from a document in the search result, e.g <co: 0>my fact</co: 0> for a fact from document 0.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'''
>>> inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors='pt')
>>> outputs = model.generate(inputs, max_new_tokens=128)
>>> print(tokenizer.decode(outputs[0]))
Relevant Documents: 0,1
Cited Documents: 0,1
Answer: The Emperor Penguin is the tallest or biggest penguin in the world. It is a bird that lives only in Antarctica and grows to a height of around 122 centimetres.
Grounded answer: The <co: 0>Emperor Penguin</co: 0> is the <co: 0>tallest</co: 0> or biggest penguin in the world. It is a bird that <co: 1>lives only in Antarctica</co: 1> and <co: 0>grows to a height of around 122 centimetres.</co: 0>
Source code in mindnlp/transformers/models/cohere/tokenization_cohere_fast.py
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
def apply_grounded_generation_template(
    self,
    conversation: Union[List[Dict[str, str]], "Conversation"],
    documents: List[Dict],
    citation_mode: Literal["fast", "accurate"] = "accurate",
    **kwargs,
) -> Union[str, List[int]]:
    """Create a Command-R grounded generation (aka RAG) prompt.

    Once rendered, the prompt instructs the model to generate a response with citations in, based on supplied documents.

    Conceptually, this works in the same way as `apply_chat_format`, but takes additional `documents`
    and parameter `citation_mode` parameters.

    Converts a Conversation object or a list of dictionaries with `"role"` and `"content"` keys and a list of
    documents for the model to ground its response on into a prompt string, or a list of token ids.
    This method will use the tokenizer's `grounded_generation_template` template specified at the class level.
    You can override the default template using the `grounded_generation_template` kwarg but the quality of your results may decrease.

    Args:
        conversation (Union[List[Dict[str, str]], "Conversation"]): A Conversation object or list of dicts
            with "role" and "content" keys, representing the chat history so far.
        documents (List[Dict[str, str]): A list of dicts, representing documents or tool outputs to ground your
            generation on. A document is a semistructured dict, wiht a string to string mapping. Common fields are
            `url`, `title`, `snippet` etc but should be descriptive of the key. They will get rendered into the prompt.
        citation_mode: either "accurate" (prompt the model to generate an answer first, then rewrite it with citation
            spans in) or "fast", where the prompt instructs the model to generate an answer with citations in directly.
            The former has higher quality citations, the latter requires fewer tokens to be generated.
        add_generation_prompt (bool, *optional*): Whether to end the prompt with the token(s) that indicate
            the start of an assistant message. This is useful when you want to generate a response from the model.
            Note that this argument will be passed to the chat template, and so it must be supported in the
            template for this argument to have any effect.
        tokenize (`bool`, defaults to `True`):
            Whether to tokenize the output. If `False`, the output will be a string.
        padding (`bool`, defaults to `False`):
            Whether to pad sequences to the maximum length. Has no effect if tokenize is `False`.
        truncation (`bool`, defaults to `False`):
            Whether to truncate sequences at the maximum length. Has no effect if tokenize is `False`.
        max_length (`int`, *optional*):
            Maximum length (in tokens) to use for padding or truncation. Has no effect if tokenize is `False`. If
            not specified, the tokenizer's `max_length` attribute will be used as a default.
        return_tensors (`str` or [`~utils.TensorType`], *optional*):
            If set, will return tensors of a particular framework. Has no effect if tokenize is `False`. Acceptable
            values are:

            - `'tf'`: Return TensorFlow `tf.Tensor` objects.
            - `'pt'`: Return PyTorch `torch.Tensor` objects.
            - `'np'`: Return NumPy `np.ndarray` objects.
            - `'jax'`: Return JAX `jnp.ndarray` objects.
        return_dict (`bool`, *optional*, defaults to `False`):
            Whether to return a dictionary with named outputs. Has no effect if tokenize is `False`.
        **tokenizer_kwargs: Additional kwargs to pass to the tokenizer.

    Returns:
        Conditional return:

            - `str`: A rendered prompt string.
            - or if tokenize=True:
            - `List[int]`: A list of token ids representing the tokenized chat so far, including control tokens. This
                output is ready to pass to the model, either directly or via methods like `generate()`.

    Example:
        ```python
        >>> tokenizer = CohereTokenizerFast.from_pretrained('CohereForAI/c4ai-command-r-v01')
        ...
        >>> # define documents:
        >>> documents = [
            { "title": "Tall penguins", "text": "Emperor penguins are the tallest." },
            { "title": "Penguin habitats", "text": "Emperor penguins only live in Antarctica."}
        ]
        >>> # define a conversation:
        >>> conversation = [
            {"role": "user", "content": "Whats the biggest penguin in the world?"}
        ]
        >>> # render the prompt, ready for user to inspect, or for input into the model:
        >>> grounded_generation_prompt = tokenizer.apply_grounded_generation_template(conversation, documents=documents, tokenize=False, add_generation_prompt=True)
        >>> print(grounded_generation_prompt)
        <BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|># Safety Preamble
        The instructions in this section override those in the task description and style guide sections. Don't answer questions that are harmful or immoral.

        ## Basic Rules
        You are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user's requests, you cite your sources in your answers, according to those instructions.

        # User Preamble
        ## Task and Context
        You help people answer their questions and other requests interactively. You will be asked a very wide array of requests on all kinds of topics. You will be equipped with a wide range of search engines or similar tools to help you, which you use to research your answer. You should focus on serving the user's needs as best you can, which will be wide-ranging.

        ## Style Guide
        Unless the user asks for a different style of answer, you should answer in full sentences, using proper grammar and spelling.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Whats the biggest penguin in the world?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|><results>
        Document: 0
        title: Tall penguins
        text: Emperor penguins are the tallest.

        Document: 1
        title: Penguin habitats
        text: Emperor penguins only live in Antarctica.
        </results><|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Carefully perform the following instructions, in order, starting each with a new line.
        Firstly, Decide which of the retrieved documents are relevant to the user's last input by writing 'Relevant Documents:' followed by comma-separated list of document numbers. If none are relevant, you should instead write 'None'.
        Secondly, Decide which of the retrieved documents contain facts that should be cited in a good answer to the user's last input by writing 'Cited Documents:' followed a comma-separated list of document numbers. If you dont want to cite any of them, you should instead write 'None'.
        Thirdly, Write 'Answer:' followed by a response to the user's last input in high quality natural english. Use the retrieved documents to help you. Do not insert any citations or grounding markup.
        Finally, Write 'Grounded answer:' followed by a response to the user's last input in high quality natural english. Use the symbols <co: doc> and </co: doc> to indicate when a fact comes from a document in the search result, e.g <co: 0>my fact</co: 0> for a fact from document 0.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'''
        >>> inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors='pt')
        >>> outputs = model.generate(inputs, max_new_tokens=128)
        >>> print(tokenizer.decode(outputs[0]))
        Relevant Documents: 0,1
        Cited Documents: 0,1
        Answer: The Emperor Penguin is the tallest or biggest penguin in the world. It is a bird that lives only in Antarctica and grows to a height of around 122 centimetres.
        Grounded answer: The <co: 0>Emperor Penguin</co: 0> is the <co: 0>tallest</co: 0> or biggest penguin in the world. It is a bird that <co: 1>lives only in Antarctica</co: 1> and <co: 0>grows to a height of around 122 centimetres.</co: 0>
        ```
    """
    return self.apply_chat_template(
        conversation,
        chat_template="rag",
        documents=documents,
        citation_mode=citation_mode,
        **kwargs,
    )

mindnlp.transformers.models.cohere.tokenization_cohere_fast.CohereTokenizerFast.apply_tool_use_template(conversation, tools, **kwargs)

Create a Command-R tool-use prompt.

Once rendered, the prompt instructs the model to generate a list of actions to perform on a set of user supplied tools to help carry out the user's requests.

Conceptually, this works in the same way as apply_chat_format, but takes an additional tools parameter.

Converts a Conversation object or a list of dictionaries with "role" and "content" keys and a list of available tools for the model to use into a prompt string, or a list of token ids. This method will use the tokenizer's default_tool_use_template template specified at the class level. You can override the default template using the tool_use_template kwarg but the quality of your results may decrease.

PARAMETER DESCRIPTION
conversation

A Conversation object or list of dicts with "role" and "content" keys, representing the chat history so far.

TYPE: Union[List[Dict[str, str]], Conversation]

tools

a list of tools to render into the prompt for the model to choose from. See an example at the bottom of the docstring. The format should be:

  • name (str): The name of the tool to be called. Valid names contain only the characters a-z, A-Z, 0-9, _ and must not begin with a digit.
  • description (str): The description of what the tool does, the model uses the description to choose when and how to call the function.
  • parameter_definitions (List[Dict]): The input parameters of the tool. Accepts a dictionary where the key is the name of the parameter and the value is the parameter spec. Valid parameter names contain only the characters a-z, A-Z, 0-9, _ and must not begin with a digit. Parameter specs are as follows:

    • description (str): The description of the parameter.
    • type (str): the type of the parameter - most effective for python builtin data types, such as 'str', 'bool'
    • required: boolean: Denotes whether the parameter is always present (required) or not. Defaults to not required.

TYPE: List[Dict]

add_generation_prompt

Whether to end the prompt with the token(s) that indicate the start of an assistant message. This is useful when you want to generate a response from the model. Note that this argument will be passed to the chat template, and so it must be supported in the template for this argument to have any effect.

TYPE: bool, *optional*

tokenize

Whether to tokenize the output. If False, the output will be a string.

TYPE: `bool`, defaults to `True`

padding

Whether to pad sequences to the maximum length. Has no effect if tokenize is False.

TYPE: `bool`, defaults to `False`

truncation

Whether to truncate sequences at the maximum length. Has no effect if tokenize is False.

TYPE: `bool`, defaults to `False`

max_length

Maximum length (in tokens) to use for padding or truncation. Has no effect if tokenize is False. If not specified, the tokenizer's max_length attribute will be used as a default.

TYPE: `int`, *optional*

return_tensors

If set, will return tensors of a particular framework. Has no effect if tokenize is False. Acceptable values are:

  • 'tf': Return TensorFlow tf.Tensor objects.
  • 'pt': Return PyTorch torch.Tensor objects.
  • 'np': Return NumPy np.ndarray objects.
  • 'jax': Return JAX jnp.ndarray objects.

TYPE: `str` or [`~utils.TensorType`], *optional*

return_dict

Whether to return a dictionary with named outputs. Has no effect if tokenize is False.

TYPE: `bool`, *optional*, defaults to `False`

**tokenizer_kwargs

Additional kwargs to pass to the tokenizer.

RETURNS DESCRIPTION
Union[str, List[int]]

Conditional return:

  • str: A rendered prompt string.
  • if tokenize=True: List[int]: A list of token ids representing the tokenized chat so far, including control tokens. This output is ready to pass to the model, either directly or via methods like generate().
Example
>>> tokenizer = CohereTokenizerFast.from_pretrained("CohereForAI/c4ai-command-r-v01")
>>> tools = [
...     {
...         "name": "internet_search",
...         "description": "Returns a list of relevant document snippets for a textual query retrieved from the internet",
...         "parameter_definitions": {
...             "query": {
...                 "description": "Query to search the internet with",
...                 "type": "str",
...                 "required": True
...             }
...         }
...     },
...     {
...         "name': "directly_answer",
...         "description": "Calls a standard (un-augmented) AI chatbot to generate a response given the conversation history",
...         "parameter_definitions": {}
...     }
... ]
>>> conversation = [
...     {"role": "user", "content": "Whats the biggest penguin in the world?"}
... ]
>>> # render the prompt, ready for user to inspect, or for input into the model:
>>> prompt = tokenizer.apply_tool_use_template(conversation, tools=tools, tokenize=False, add_generation_prompt=True)
>>> print(prompt)
<BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|># Safety Preamble
The instructions in this section override those in the task description and style guide sections. Don't answer questions that are harmful or immoral.

# System Preamble
## Basic Rules
You are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user's requests, you cite your sources in your answers, according to those instructions.

# User Preamble
## Task and Context
You help people answer their questions and other requests interactively. You will be asked a very wide array of requests on all kinds of topics. You will be equipped with a wide range of search engines or similar tools to help you, which you use to research your answer. You should focus on serving the user's needs as best you can, which will be wide-ranging.

## Style Guide
Unless the user asks for a different style of answer, you should answer in full sentences, using proper grammar and spelling.

## Available Tools
Here is a list of tools that you have available to you:

\`\`\`python
def internet_search(query: str) -> List[Dict]:
    """Returns a list of relevant document snippets for a textual query retrieved from the internet

    Args:
        query (str): Query to search the internet with
    """
    pass
\`\`\`

\`\`\`python
def directly_answer() -> List[Dict]:
    """Calls a standard (un-augmented) AI chatbot to generate a response given the conversation history
    """
    pass
\`\`\`<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Whats the biggest penguin in the world?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Write 'Action:' followed by a json-formatted list of actions that you want to perform in order to produce a good response to the user's last input. You can use any of the supplied tools any number of times, but you should aim to execute the minimum number of necessary actions for the input. You should use the `directly-answer` tool if calling the other tools is unnecessary. The list of actions you want to call should be formatted as a list of json objects, for example:
\`\`\`json
[
    {
        "tool_name": title of the tool in the specification,
        "parameters": a dict of parameters to input into the tool as they are defined in the specs, or {} if it takes no parameters
    }
]\`\`\`<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
>>> inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors='pt')
>>> outputs = model.generate(inputs, max_new_tokens=128)
>>> print(tokenizer.decode(outputs[0]))

Action: json [ { "tool_name": "internet_search", "parameters": { "query": "biggest penguin in the world" } } ]

Source code in mindnlp/transformers/models/cohere/tokenization_cohere_fast.py
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
def apply_tool_use_template(
    self,
    conversation: Union[List[Dict[str, str]], "Conversation"],
    tools: List[Dict],
    **kwargs,
) -> Union[str, List[int]]:
    """Create a Command-R tool-use prompt.

    Once rendered, the prompt instructs the model to generate a list of actions to perform on a set of user supplied tools
    to help carry out the user's requests.

    Conceptually, this works in the same way as `apply_chat_format`, but takes an additional `tools` parameter.

    Converts a Conversation object or a list of dictionaries with `"role"` and `"content"` keys and a list of available
    tools for the model to use into a prompt string, or a list of token ids.
    This method will use the tokenizer's `default_tool_use_template` template specified at the class level.
    You can override the default template using the `tool_use_template` kwarg but the quality of your results may decrease.

    Args:
        conversation (Union[List[Dict[str, str]], "Conversation"]): A Conversation object or list of dicts
            with "role" and "content" keys, representing the chat history so far.
        tools (List[Dict]): a list of tools to render into the prompt for the model to choose from.
            See an example at the bottom of the docstring. The format should be:

            - name (str): The name of the tool to be called. Valid names contain only the characters a-z,
            A-Z, 0-9, _ and must not begin with a digit.
            - description (str): The description of what the tool does, the model uses the description to
            choose when and how to call the function.
            - parameter_definitions (List[Dict]): The input parameters of the tool. Accepts a dictionary
            where the key is the name of the parameter and the value is the parameter spec.
            Valid parameter names contain only the characters a-z, A-Z, 0-9, _ and must not begin with a digit.
            Parameter specs are as follows:

                - description (str): The description of the parameter.
                - type (str): the type of the parameter - most effective for python builtin data types, such as 'str', 'bool'
                - required: boolean: Denotes whether the parameter is always present (required) or not. Defaults to not required.
        add_generation_prompt (bool, *optional*): Whether to end the prompt with the token(s) that indicate
            the start of an assistant message. This is useful when you want to generate a response from the model.
            Note that this argument will be passed to the chat template, and so it must be supported in the
            template for this argument to have any effect.
        tokenize (`bool`, defaults to `True`):
            Whether to tokenize the output. If `False`, the output will be a string.
        padding (`bool`, defaults to `False`):
            Whether to pad sequences to the maximum length. Has no effect if tokenize is `False`.
        truncation (`bool`, defaults to `False`):
            Whether to truncate sequences at the maximum length. Has no effect if tokenize is `False`.
        max_length (`int`, *optional*):
            Maximum length (in tokens) to use for padding or truncation. Has no effect if tokenize is `False`. If
            not specified, the tokenizer's `max_length` attribute will be used as a default.
        return_tensors (`str` or [`~utils.TensorType`], *optional*):
            If set, will return tensors of a particular framework. Has no effect if tokenize is `False`. Acceptable
            values are:

            - `'tf'`: Return TensorFlow `tf.Tensor` objects.
            - `'pt'`: Return PyTorch `torch.Tensor` objects.
            - `'np'`: Return NumPy `np.ndarray` objects.
            - `'jax'`: Return JAX `jnp.ndarray` objects.
        return_dict (`bool`, *optional*, defaults to `False`):
            Whether to return a dictionary with named outputs. Has no effect if tokenize is `False`.
        **tokenizer_kwargs:
            Additional kwargs to pass to the tokenizer.

    Returns:
        Conditional return:

            - `str`: A rendered prompt string.
            - if tokenize=True:
            `List[int]`: A list of token ids representing the tokenized chat so far, including control tokens. This
            output is ready to pass to the model, either directly or via methods like `generate()`.

    Example:
        ```python
        >>> tokenizer = CohereTokenizerFast.from_pretrained("CohereForAI/c4ai-command-r-v01")
        >>> tools = [
        ...     {
        ...         "name": "internet_search",
        ...         "description": "Returns a list of relevant document snippets for a textual query retrieved from the internet",
        ...         "parameter_definitions": {
        ...             "query": {
        ...                 "description": "Query to search the internet with",
        ...                 "type": "str",
        ...                 "required": True
        ...             }
        ...         }
        ...     },
        ...     {
        ...         "name': "directly_answer",
        ...         "description": "Calls a standard (un-augmented) AI chatbot to generate a response given the conversation history",
        ...         "parameter_definitions": {}
        ...     }
        ... ]
        >>> conversation = [
        ...     {"role": "user", "content": "Whats the biggest penguin in the world?"}
        ... ]
        >>> # render the prompt, ready for user to inspect, or for input into the model:
        >>> prompt = tokenizer.apply_tool_use_template(conversation, tools=tools, tokenize=False, add_generation_prompt=True)
        >>> print(prompt)
        <BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|># Safety Preamble
        The instructions in this section override those in the task description and style guide sections. Don't answer questions that are harmful or immoral.

        # System Preamble
        ## Basic Rules
        You are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user's requests, you cite your sources in your answers, according to those instructions.

        # User Preamble
        ## Task and Context
        You help people answer their questions and other requests interactively. You will be asked a very wide array of requests on all kinds of topics. You will be equipped with a wide range of search engines or similar tools to help you, which you use to research your answer. You should focus on serving the user's needs as best you can, which will be wide-ranging.

        ## Style Guide
        Unless the user asks for a different style of answer, you should answer in full sentences, using proper grammar and spelling.

        ## Available Tools
        Here is a list of tools that you have available to you:

        \\`\\`\\`python
        def internet_search(query: str) -> List[Dict]:
            \"\"\"Returns a list of relevant document snippets for a textual query retrieved from the internet

            Args:
                query (str): Query to search the internet with
            \"\"\"
            pass
        \\`\\`\\`

        \\`\\`\\`python
        def directly_answer() -> List[Dict]:
            \"\"\"Calls a standard (un-augmented) AI chatbot to generate a response given the conversation history
            \"\"\"
            pass
        \\`\\`\\`<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Whats the biggest penguin in the world?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Write 'Action:' followed by a json-formatted list of actions that you want to perform in order to produce a good response to the user's last input. You can use any of the supplied tools any number of times, but you should aim to execute the minimum number of necessary actions for the input. You should use the `directly-answer` tool if calling the other tools is unnecessary. The list of actions you want to call should be formatted as a list of json objects, for example:
        \\`\\`\\`json
        [
            {
                "tool_name": title of the tool in the specification,
                "parameters": a dict of parameters to input into the tool as they are defined in the specs, or {} if it takes no parameters
            }
        ]\\`\\`\\`<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
        ```

        ```python
        >>> inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors='pt')
        >>> outputs = model.generate(inputs, max_new_tokens=128)
        >>> print(tokenizer.decode(outputs[0]))
        ```

        Action: ```json
        [
            {
                "tool_name": "internet_search",
                "parameters": {
                    "query": "biggest penguin in the world"
                }
            }
        ]
        ```
    """
    return self.apply_chat_template(
        conversation,
        chat_template="tool_use",
        tools=tools,
        **kwargs,
    )

mindnlp.transformers.models.cohere.tokenization_cohere_fast.CohereTokenizerFast.update_post_processor()

Updates the underlying post processor with the current bos_token and eos_token.

Source code in mindnlp/transformers/models/cohere/tokenization_cohere_fast.py
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
def update_post_processor(self):
    """
    Updates the underlying post processor with the current `bos_token` and `eos_token`.
    """
    bos = self.bos_token
    bos_token_id = self.bos_token_id
    if bos is None and self.add_bos_token:
        raise ValueError("add_bos_token = True but bos_token = None")

    eos = self.eos_token
    eos_token_id = self.eos_token_id
    if eos is None and self.add_eos_token:
        raise ValueError("add_eos_token = True but eos_token = None")

    single = f"{(bos+':0 ') if self.add_bos_token else ''}$A:0{(' '+eos+':0') if self.add_eos_token else ''}"
    pair = f"{single}{(' '+bos+':1') if self.add_bos_token else ''} $B:1{(' '+eos+':1') if self.add_eos_token else ''}"

    special_tokens = []
    if self.add_bos_token:
        special_tokens.append((bos, bos_token_id))
    if self.add_eos_token:
        special_tokens.append((eos, eos_token_id))
    self._tokenizer.post_processor = processors.TemplateProcessing(
        single=single, pair=pair, special_tokens=special_tokens
    )