cd ..

Encodings

Text Input

$$\text{{text}} = \text{“Tokens in NLP.”}$$

Tokenised Letters

$$[\text{T}, \text{o},\text{k},\text{e},\text{n},\text{s},\text{ },\text{i},\text{n}, \text{ }, \text{N},\text{L}, \text{P}, \text{.}]$$

Index Encoding Scheme

$$[\text{’ ‘}: 0, \text{.}: 1, \text{L}: 2, \text{N}: 3, \text{P}: 4, \text{T}: 5,\text{e}: 6, \text{i}: 7, \text{k}: 8, \text{n}: 9, \text{o}: 10, \text{s}: 11]$$

Encoded Text

$$[5, 10, 8, 6, 9, 11, 0, 7, 9, 0, 3, 2, 4, 1]$$

One Hot Encoding

wordtoken123456789101112
T5000001000000
o10000000000100
k8000000001000
e6000000100000
n9000000000100
s11000000000001
0100000000000
i7000000010000
n9000000000100
0100000000000
N3000100000000
L2001000000000
P4000010000000
.1010000000000