๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

Projects/Hate Speech Detection

[๋ถ„์„ ๋ฐฉ๋ฒ•] HAN(Hierarchical Attention Network) ์ด๋ž€?

Hate speech detection ์„ ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์–ด๋–ค ๋”ฅ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ• ๊นŒ ๊ณ ๋ฏผํ•˜๋˜ ์ค‘, HAN ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋Œ€ํ•ด ์•Œ๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

HAN(Hierarchical Attention Network) ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ฌธ์„œ ๋ถ„๋ฅ˜์— ํŠนํ™”๋˜์–ด์žˆ๋Š” ๋”ฅ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค.

HAN์˜ ํŠน์ง•

์œ„ ๊ทธ๋ฆผ์€ HAN ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.

1) ๋ฌธ์„œ์˜ ๊ณ„์ธต์  ๊ตฌ์กฐ๋ฅผ ๋ฐ˜์˜

๋ฌธ์„œ๋Š” ๋ฌธ์žฅ๋“ค๋กœ, ๋ฌธ์žฅ์€ ๋‹จ์–ด๋“ค๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋Š”๋ฐ, ์ด๋Ÿฌํ•œ ๊ณ„์ธต์ ๊ตฌ์กฐ๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š”๋ฐ ์ ํ•ฉํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

2) Attention mechanism

์ค‘์š”ํ•œ ๋‹จ์–ด์™€ ๋ฌธ์žฅ์— ๊ฐ€์ค‘์น˜๋ฅผ ๋”ํ•ด์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ๋‘ ๊ฐ€์ง€ ํŠน์ง•์œผ๋กœ, document classification์˜ ์„ฑ๋Šฅ์„ ๋†’์—ฌ์ค๋‹ˆ๋‹ค.

 

 

 

 

๊ตฌ์กฐ๋ฅผ  ์ข€ ๋” ํŒŒํ—ค์ณ ๋ด…์‹œ๋‹ค.

 

L : document ๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฌธ์žฅ ๊ฐœ์ˆ˜

si : document ๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฌธ์žฅ ํ•˜๋‚˜ํ•˜๋‚˜

Ti : ๊ฐ ๋ฌธ์žฅ์€ Ti๊ฐœ ๋‹จ์–ด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Œ

wit with t ∈ [1, T] : i๋ฒˆ์งธ ๋ฌธ์žฅ์˜ ๋‹จ์–ด๋“ค ์„ ์˜๋ฏธ

Word Encoder

1) ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ

embed the words to vectors through an embedding matrix

bidirectional GRU ์‚ฌ์šฉ : to get annotations of words by summarizing information from both directions for words ๋‹จ์–ด์— ๋Œ€ํ•œ ์–‘๋ฐฉํ–ฅ ์ •๋ณด๋ฅผ ์š”์•ฝํ•˜์—ฌ ๋‹จ์–ด annotation์„ ์–ป๊ธฐ์œ„ํ•ด ์‚ฌ์šฉ, annotation์— ๋ฌธ๋งฅ ์ •๋ณด๋ฅผ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค.

forward : sentence si from wi1 to wiT : ๋ฌธ์žฅ ์† ๋‹จ์–ด๋“ค์„ ์ •๋ฐฉํ–ฅ์œผ๋กœ ์ฝ์Œ

backward : wiT to wi1 : ๋ฌธ์žฅ ์† ๋‹จ์–ด๋“ค์„ ์—ญ๋ฐฉํ–ฅ์œผ๋กœ ์ฝ์Œ

์œ„ ์ž‘์—…(forward hidden state์™€ backward hidden state๋ฅผ ๋ณ‘ํ•ฉํ•˜๋Š” ๊ณผ์ •) ์„ ํ†ตํ•ด์„œ ์ฃผ์–ด์ง„ ๋‹จ์–ด wit์— ๋Œ€ํ•ด annotation ์„ ์–ป๋Š”๋‹ค.

wit์„ ์ค‘์‹ฌ์œผ๋กœ ํ•œ ์ „์ฒด ๋ฌธ์žฅ์˜ ์ •๋ณด๋ฅผ ์š”์•ฝํ•ฉ๋‹ˆ๋‹ค.

Word Attention

  • ๋ชจ๋“  ๋‹จ์–ด๊ฐ€ ๋ฌธ์žฅ์—์„œ ๋™๋“ฑํ•˜๊ฒŒ ๊ธฐ์—ฌํ•˜์ง„ ์•Š๋Š”๋‹ค
  • ๋”ฐ๋ผ์„œ ๋ฌธ์žฅ์˜ ์˜๋ฏธ์— ์ค‘์š”ํ•œ ๋‹จ์–ด๋ฅผ ์ถ”์ถœํ•˜๊ณ  ๊ทธ ์ •๋ณด ๋‹จ์–ด์˜ representation์„ ์ง‘๊ณ„ํ•˜์—ฌ ๋ฌธ์žฅ ๋ฒกํ„ฐ๋ฅผ ํ˜•์„ฑํ•˜๋Š” ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค.

(5) uit(hidden representation of hit) ๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด hit(word annotation) ๋ฅผ one-layer MLP๋ฅผ ์— ์ค€๋‹ค.

(6) ๋‹จ์–ด์˜ ์ค‘์š”๋„๋ฅผ ์ธก์ • : uw(word level context vector) ์™€ uit์˜ ์œ ์‚ฌ์„ฑ์œผ๋กœ ์ธก์ •

→ softmax function์„ ํ†ตํ•ด ait (normalized ๋œ importance weight) ๋ฅผ ์–ป์Œ

(7) si (sentence vector) ๋ฅผ ๊ณ„์‚ฐ : ๊ฐ€์ค‘์น˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๋‹จ์–ด annotation์˜ ๊ฐ€์ค‘์น˜ ํ•ฉ๊ณ„(weighted sum)์œผ๋กœ.

  • uit์˜ ์—ญํ• ์€?
    hit ๋Š” forward, backward ๊ฒฐํ•ฉํ•˜์—ฌ ์ƒ๊ธด(=GRU) hidden state
    → ์–˜๋ฅผ one-layer MLP์— ๋„ฃ์–ด์ฃผ๋ฉด uit๊ฐ€ ์ƒ๊น€
    → uit๋Š” hit์˜ hidden representation ์ธ ๊ฒƒ

Sentence Encoder

  • si(sentence vector) ๊ฐ€ ์ฃผ์–ด์ง
  • ๋น„์Šทํ•œ ๋ฐฉ์‹์œผ๋กœ document vector๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Œ ⇒ bidirectional GRU

  • i ๋ฒˆ์งธ sentence ์˜ annotation : forward hi์™€ backward hj๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ์–ป์„ ์ˆ˜ ์žˆ์Œ

  • hi๋Š” ๋ฌธ์žฅ i ์ฃผ์œ„์˜ ์ด์›ƒ ๋ฌธ์žฅ์„ ์š”์•ฝํ•˜์ง€๋งŒ ์—ฌ์ „ํžˆ ๋ฌธ์žฅ i์— ์ง‘์ค‘ํ•ฉ๋‹ˆ๋‹ค.

Sentence Attention

  • ๋ฌธ์„œ๋ฅผ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ฌธ์žฅ์— ๋Œ€ํ•œ ๋ณด์ƒ : attention mechanism

(9) ๋ฌธ์žฅ์˜ ์ค‘์š”๋„๋ฅผ ์ธก์ • : us(setence level context vector) ์™€ ui์˜ ์œ ์‚ฌ์„ฑ์œผ๋กœ ์ธก์ •

→ softmax function์„ ํ†ตํ•ด ai (normalized ๋œ importance weight) ๋ฅผ ์–ป์Œ

(10) v ( document vector) ๊ณ„์‚ฐ : ๋ฌธ์„œ์— ์žˆ๋Š” ๋ฌธ์žฅ๋“ค์˜ ๋ชจ๋“  ์ •๋ณด๋ฅผ ์š”์•ฝ

  • ๋ฌธ์žฅ ์ˆ˜์ค€์˜ ์ปจํ…์ŠคํŠธ ๋ฒกํ„ฐ๋Š” ํ›ˆ๋ จ ๊ณผ์ •์—์„œ ๋ฌด์ž‘์œ„๋กœ ์ดˆ๊ธฐํ™”(randomly initialized)๋˜๊ณ  ๊ณต๋™ ํ•™์Šต ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    → (๋งค์šฐ ์ค‘์š”) ์œ„์—์„œ ๋‚˜์™”๋˜ uw(word level context vector), us(setence level context vector) ๋Š” randomly initialized ๋œ ๋ฒกํ„ฐ
    → ์‚ฌ์ „์— ์ƒ์„ฑ๋œ vector ๋“ค๊ณผ๋Š” ๋ฌด๊ด€ํ•œ ๋žœ๋ค๋ฒกํ„ฐ

Document classification

  • v (document vector) : high level representation of the document

(11) document classification์˜ features๋กœ ์‚ฌ์šฉ๊ฐ€๋Šฅ

(12) training loss๋กœ๋Š” ์˜ฌ๋ฐ”๋ฅธ ๋ ˆ์ด๋ธ”์˜ ์Œ์˜ ๋กœ๊ทธ ๊ฐ€๋Šฅ์„ฑ์„ ์‚ฌ์šฉ

  • j (๋ฌธ์„œ d์˜ label)

Reference

Yang, Zichao, et al. "Hierarchical attention networks for document classification." Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies. 2016.