Categories
Offsites

위대한 제품을 만드는 모두를 위한, 창의성을 지휘하라

픽사 애니메이션 스튜디오는 아래 이미지에서 낯익은 케릭터들이 많이 있는 만큼, 흥행한 작품들이 굉장히 많은 회사 입니다. 특히 초창기에는 작품을 낼때마다 박스오피스 1위를 할 정도였죠. 어떻게 애니메이션을 만드는지 아는 것이 별로 없지만, 창의성이 중요하다는 것은 알 수가 있습니다. 이런 창의성이 요구되는 활동을 성공적으로, 그리고 꾸준하게 해낼 수 있었던 비결은 무엇일까요?

image.png

출처: https://www.creativebloq.com/animation/job-at-pixar-10121018

그에 대한 답의 일부가 아래 애드 캣멀이 지은 ≪창의성을 지휘하라≫ 에 담겨있습니다.

이 책에는 저자 애드 캣멀이 픽사라는 회사와 함께 성장하는 과정과 애니메이션을 하나하나 성공시켜가면서, 경영자로서 스스로 무수히 많은 질문을 던지고 답을 찾아가는, 이러한 전반적인 과정들이 담겨있습니다. 그 뿐만 아니라 실질적으로 참고할 수 있도록, 픽사에서 실제 진행한 프로세스들에 대한 이야기도 포함되어 있습니다.

image.png

출처: https://www.aladin.co.kr/shop/wproduct.aspx?ItemId=45834844

경영자로서의 애드 캣멀

저는 이 책을 읽으면서 특히 저자가 경영자로서 계속해서 마주치는 문제들에 대해서 다양한 관점으로 고려하면서 답을 찾아가는 과정이 인상이 깊었습니다. 숫자를 세어본 적은 없지만, 계속해서 물음표로 글이 이어지는 부분들도 적지 않았던 것 같네요.

픽사 사장으로서 내 목표는 언제나 픽사가 창업자들(스티브 잡스, 존 래스터, 그리고 나)보다 오래 생존할 수 있게 픽사에 계속 생명력을 불어놓는 창의적 기업문화를 구축하는 것이었다. 예술과 상업이라는 상호충돌하면서도 상호보완적인 동력을 관리하느라 애를 먹고 있는 경영자들과 창의적 기업문화에 관한 철학을 공유하는 것도 내 목표다. … 나는 어떤 분야에든 사람들이 창의성을 발휘해 탁월한 성과를 내도록 이끄는 훌륭한 리더십이 필요하다고 생각한다.

  • 머리말 – 잃어버리고 되찾은 것 중에서

경영이란 이런 것이다. 타당한 이유에 따라 내린 결정이 새로운 문제를 초래하고, 이 새로운 문제를 해결하기 위해 또 다른 결정을 내려야 한다. 기업에서 발생하는 문제는 최초의 오류를 수정하는 것마으로 간단히 풀리는 법이 없다. 하나의 문제를 해결하는 일은 여러 단계를 거쳐야 하는 과정이다. 최초의 문제뿐만 아니라 여기서 파생된 문제가 무엇인지 파악해 함께 해결해야 한다.

  • Chapter 1 애니메이션과 기술의 만남

나는 오랫동안 인식의 한계에 대해 고민했다. 경영자는 항상 인식의 한계에 관한 문제로 인해 고민하게 마련이다. 우리는 어디까지 볼 수 있는가? 얼마나 많은 것을 불명확하게 인식하고 있는가? 우리가 카산드라의 경고를 무시하고 있는 건 아닐까? 다시 말해, 우리가 최선을 다해 문제를 파악하려고 노력해도 문제를 인식하지 못하는 저주를 받은 것은 아닐까? 이런 의문들의 답을 찾는 과정은 창의적 기업문화를 유지하는 열쇠다. 그리고 그 답을 탐구하는 것이 이 책의 핵심 내용이다.

  • Chapter 9 잠재적 위험에 대처하는 법 중에서

이 부분은 애드 캣멀이 어떠한 사람인지를 알 수 있는 대목이기도 합니다.

나는 다른 기업 사람들이 나와 달리 공격적이고 확신에 찬 행동을 하는 것을 보고, 내가 사장답지 않아 보일까 봐 걱정했다. 더 정확히 말하면, 나는 실패가 두려웠다.
나는 연설에서 8~9년 전에야 겨우 내가 사기꾼이 된 것 같은 기분에서 벗어날 수 있었다고 털어놨다. … “자신이 사기꾼처럼 느껴지는 분이 있나요?” 그러면 언제나 모든 직원이 손을 든다.
… 자신이 경영자로서 잘해 나가고 있는지 파악하려면, 머릿속에서 경영자란 어떤 것이어야 한다는 심성모형을 지워야 한다. 경영자의 성공을 측정하는 척도는 ‘팀원들이 잘 협력해 핵심적인 문제들을 해결하고 있는가?’ 이다.

  • Chapter 6 실패와 공포에 대처하는 법 중에서

제품을 만드는 모두를 위한,

이 책을 추천하는 이유는 픽사라는 회사를 가지고 이야기를 하고 있지만, 제품을 만드는 수 많은 회사들에도 적용할 수 있는 좋은 길잡이이기 때문입니다.

먼저 우리는 그 어느때보다 ‘창의력’이 중요시 되는 시대에 살고 있습니다. ‘지식근로자’라는 말은 1968년 피터드러커가 ≪단절의 시대≫ 을 저술하면서 처음 언급 했었습니다. 이제는 이 지식근로자라는 말이 전혀 낯설지 않은 시대에 우리는 살고 있습니다. 또한 실제로 AI가 급격히 발전하면서 단순 반복적인 일들이 대체되고 있는 상황입니다. 이에 자연스럽게, AI가 잘하지 못하는 일들이 이 지식근로자들에게 요구되고 있습니다. 이러한 일들 중 가장 기본이 되는 것이 무엇일까요? 핵심 요소 중에서 하나가 바로 창의성일 것입니다.

새로운 제품을 만드는데 있어서 이 창의성은 굉장히 중요한 요소입니다. 문제를 발견하고, 어떤 식으로 풀지 이 전반적인 과정에 모두 이 창의성이 활용되기 때문이죠. 그리고 픽사에서 애니메이션을 제작하는 것을 스타트업에서 제품을 만들어내는 것에 비유할 수 있습니다. 흔히 스타트업에서 새로운 제품을 만드는 과정을 다음과 같이 이야기 할 수 있습니다.

  • 고객을 깊이 이해하고, 주어진 문제를 풀어내서, 고객에게 전달하여 감동을 주는 것.

에니메이션을 만드는 과정 또한 비슷한 프레임을 가지고 있습니다.

  • 전달하고자 하는 스토리를 구상하고, 배경과 하나하나의 케릭터에 숨을 불어넣어서, 관객들에게 전달하여 감동을 주는 것.

전달하는 제품의 형태만 다를 뿐, 고객(관객)들에게 감동을 주는 것은 동일하기 때문입니다. 그럼 이 다음부터는 픽사가 어떤 식으로 성장해왔는지 간략하게 다뤄보겠습니다.

픽사의 정체성, 목표와 기업문화 (사람과 프로세스)

먼저 인상 깊었던 부분들 중 하나는 픽사라는 회사의 색깔을 명확하게 한 것입니다. 다음은 회사의 정체성을 구축하는 데 있어 가장 중요한 두가지 요소입니다. 첫 번째가 회사가 추구하는 목표(미션)가 무엇인지, 두 번째로는 기업 문화에 해당합니다. 픽사는 사람들에게 감동을 주는 애니메이션을 만드는 것을 목표로 하였고, 상황에 타협하지 않았고 본질을 계속 추구한 집단이기도 하였습니다.

당시 픽사는 파산 위기에 처한 신생 영화사에 불과했지만, 직원들은 신념을 공유했다. 우리가 보고 싶은 영화를 만들면 관객들도 보러 올 것이라는 신념이었다. 우리는 이 신념을 지키기 위해, 계속해서 커다란 바위를 산꼭대기로 밀어올리는 시시포스처럼 불가능한 일에 무모하게 도전하는 기분을 맛봐야 했다.
… 픽사 임직원이 자부심을 느낀 대목은 이런 숫자가 아니다. 수익은 기업의 성과를 측정하는 여러 가지 잣대 중 하나로, 가장 의미 있는 잣대라고 할 수 없다. 내가 가장 자부심을 느끼는 부분은 우리가 만든 작품의 예술적 성취도다. … 우리가 추구한 진짜 목적은 그때나 지금이나 위대한 영화를 제작하는 것이다.

  • 머리말 – 잃어버리고 되찾은 것 중에서

‘토이 스토리’를 제작하는 과정에서 픽사에 지대한 영향을 미치게 될 두 가지 창작 원칙이 만들어졌다. …
제1원칙은 “스토리가 왕이다(Story is King)” 이다. …
제2원칙은 “프로세스를 신뢰하라(Trust the Process)”다. 우리는 이 원칙을 좋아했다. 제작 과정의 고민을 덜어주는 원칙이기 때문이다.

  • Chapter 4 픽사의 정체성 구축 중에서

우리는 디즈니 중역들을 만나 극장용보다 질이 낮은 비디오 대여용 애니메이션을 제작하는 일은 픽사의 기업 성향에 맞지 않는다고 말했다. 픽사의 기업문화는 이를 용납하지 못했다. 우리는 진로를 바꾸어 ‘토이 스토리2’를 극장용 애니메이션으로 만들자고 제안했다. 놀랍게도 디즈니 중역들은 흔쾌히 승낙했다.

  • Chapter 4 픽사의 정체성 구축 중에서

기업문화에 있어서 가장 중요한 요소는 ‘사람’ 입니다. 많은 책들에서 적합한 사람, 인재밀도 등을 언급하는 것처럼, 저자 역시 사람에 대해서 특히 중요하게 여기는 것을 볼 수가 있습니다.

나는 다음과 같은 교훈을 얻었다. 좋은 아이디어를 평범한 팀에 맡기면 실망스러운 결과가 나온다. 반면 평범한 아이디어를 탁월한 팀에 맡기면, 그들은 아이디어를 수정하든 폐기하든 해서 더 나은 결과를 내놓는다.
이 교훈은 더 설명할 가치가 있다. 적합한 팀에게 일을 맡기는 것이 아이디어를 성공적으로 구현하는 선결 조건이다. 재능 있는 인재들을 원한다고 말하기는 쉽고, 경영자들 또한 재능 있는 인재들을 원하지만, 정말로 핵심 관건은 이런 인재들이 상호작용하는 방식이다. 아무리 영리한 사람들을 모아놓아도 서로 어울리지 않으면 비효율적인 팀이 된다. 경여앚가 직원 개개인의 재능이 아니라 팀이 돌아가는 상황에 초점을 맞추는 편이 낫다는 뜻이다. 좋은 팀은 서로 보완해주는 사람들로 구성된다. … 업무에 적합한 인재들이 상성이 맞는 사람들과 함께 일하도록 하는 것이 좋은 아이디어를 내는 것보다 중요하다.

  • Chapter 4 픽사의 정체성 구축 중에서

한 사람만이 내가 던진 질문에 오류가 있다고 지적했다. 아이디어는 사람에게서 나온다. 사람이 없으면 아이디어도 없다. 따라서 사람이 아이디어보다 중요하다.
왜 사람이 아이디어보다 중요하다고 생각하지 못할까? 너무나 많은 사람이 아이디어가 사람들과 완전히 분리된 채 독립적으로 형성되고 존재한다고 착각하기 때문이다. 아이디어는 독립적인 존재가 아니다. 아니디어는 종종 수십 명이 관여하는 수만 가지 의사결정을 통해 형성된다. … 영화는 하나의 아이디어만으로 만들어질 수 없다. 영화는 여러 아이디어의 집합체다. 이런 아이디어들을 구상하고 현실로 구현하는 것은 결국 사람이다. 모든 제품이 마찬가지다. …
다시 말해, 사람(직원들의 근무 습관, 재능, 가치)에게 초점을 맞추는 것이 모든 창조적 사업의 핵심 성공 비결이다. 제작 과정에서 나는 이런 사실을 이전보다 명확히 인식하게 됐다. .. 이후 픽사는 사람을 우선하는 경영 모델을 꾸준히 실천하고 있다. 세부적인 수정은 있었지만, 근본 원리는 언제나 같다. 좋은 인재를 육성하고 지원하면 그들이 좋은 아이디어를 내놓는다는 것이다.토이>

  • Chapter 4 픽사의 정체성 구축 중에서

다음으로는 뛰어난 사람들이 각자의 역량을 펼칠 수 있도록, 서로 효율적으로 협력하는 방법들에 대한 고민을 계속 하는 것을 볼 수 있습니다. 이는 제품의 질이 직원들에게 달려있다는 것을 전적으로 믿기 때문이죠.

당시 일본 제조업에서 일어난 품질관리 혁명을 설명하기 위해 적기공급생산(just-in-time manufacturing), 전사적 품질관리(total quality control) 같은 경영학 용어들이 나왔다. 이런 용어들이 설명하려는 아이디어는 간단하다. 즉, 문제를 파악해 수정할 권한을 고위 간부부터 생산라인 말단직원까지 모든 임직원에게 부여해야 한다는 것이다. 에드워드 데밍은 어떤 직급의 직원이라도 제조 과정에서 문제를 발견하면 조립라인을 멈추도록 장려해야 한다고 생각했다. …
에드워드 데밍과 도요타의 접근법은 제품 생산 과정에 밀접하게 관여하는 사람들에게 제품의 품질을 높일 권한과 책임을 부여했다. 이 과정에서 일본 근로자들은 자신이 단지 컨베이어벨트 위를 지나가는 부품들을 조립하는, 영혼 없는 톱니바퀴 같은 존재가 아니라, 제품 생산 과정의 문제를 지적하고, 변화를 제안하고, 문제 해결에 기여해 회사를 키우는 구성원이라는 ‘자부심’을 느꼈다(나는 특히 마지막 대목이 중요하다고 생각한다).
… 에드워드 데밍의 품질관리 이론은 ‘모든 직원은 먼저 허락받지 않은 채, 문제 해결에 나설 수 있어야 한다’ 라는 민주적 원칙에 기반을 두고 있다. … 경영진이 단기이익 극대화에 집착하자 자부심을 가지고 품질 향상에 힘쓰던 직원들은 영혼을 잃은 채 일하게 됐고, 그 결과 품질 결함이 발생하기 시작했다.

  • Chapter 3 의 탄생과 목표의 재정립 중에서토이>

이러한 고민들의 결과로서 창작 제 2원칙으로 “프로세스를 신뢰하라” 가 있는 것처럼, 픽사에서는 좋은 작품을 만들기 위한 다양한 프로세스를 정립하기 위한 많은 시도들이 있었습니다. 아래는 픽사라는 회사가 어느정도 큰 큐모를 갖춘 이후에 정립된 프로세스을 나열해둔 것 입니다.

  1. 데일리스 회의
  2. 현장답사
  3. 한도 결정
  4. 기술과 예술의 융합
  5. 소규모 실험
  6. 보는 법 배우기
  7. 사후분석 회의
  8. 픽사대학

프로세스들을 하나하나 보다보면, 이러한 생각이 들기도 합니다. ‘우리도 이런 것들은 하고 있는 거 같은데?’ 하지만 자세히 들여다보면 픽사가 구축해온 기업문화를 기반으로 위 프로세스들이 일반적인 기업들과는 다르게 돌아가는 것을 이해할 수 있습니다. 최근에 읽었던 넷플릭스의 ≪규칙없음≫ 에서 자유와 책임이 주어지는 것처럼, 픽사에서 역시 개개인에게 자유와 책임이 주어지는 것을 느낄 수 있었습니다.

‘픽사의 브레인트러스트가 다른 기업의 피드백 매커니즘과 다른 게 뭐야?’ … 첫째, 픽사의 브레인트러스트는 스토리텔링을 심도 있게 이해하는 사람들, 대게 작품 제작에 참여해본 경험이 있는 사람들로 구성된다. 픽사 감독들은 다양한 사람들의 비평을 환영하지만(사실 모든 픽사 직원들은 중간결과물을 보고 의견서를 보내야 한다), 특히 동료 감독, 각본가가 보낸 피드백을 더 진지하게 받아들인다. 둘째, 픽사의 브레인트러스트는 지시할 권한이 없다. 이는 중요한 차이다. 감독은 브레인트러스트의 특정 제안을 꼭 받아들여야 할 필요가 없다. 브레인트러스트 회의 후, 브레인트러스트의 피드백을 어떻게 받아들일지 결정하는 것은 감독의 몫이다.

  • Chapter 5 솔직함의 가치 중에서

위의 프로세스 외에도 인턴 프로그램, 전사 워크샵인 노트 데이 등에 대한 이야기가 담겨 있습니다. 특히 노트데이에서는 단순히 직원들의 참여를 유도하는 것이 아니라 토론회가 제대로 돌아갈 수 있도록 설계하는 과정까지 담겨 있어서 흥미롭게 볼 수 있었습니다.

제작비를 10퍼센트 감축하는 방안을 고민하는 자리에서 귀도 쿼로니는 간단한 제안을 했다.
“모든 직원에게 비용 삭감 아이디어를 내달라고 요청합시다.”
존 래스터는 흥분한 얼굴로 말했다. “흥미로운 제안이군요. 하루 동안 모든 픽사 직원의 업무를 중단하고, 이 문제를 해결할 방안을 얘기해보는 게 어떨까요?”

  • Chapter 13 노트 데이 토론회 중에서

끝으로

수 많은 스타트업을 꿈꾸는 분들에게, 특히 정체성이 명확한 회사를 만들고 싶은 분들에게 좋은 길잡이가 되어줄 책이라고 생각을 합니다. 마지막으로 존 래스터가 기업문화에 대해서 이야기한 문장과 애드 캣멀의 경영자의 목표로 책 소개를 마무리 하고자 합니다.

존 래스터는 종종 픽사의 조직문화를 살아 있는 유기체에 비유했다. “지구상에 출현한 적 없는 새로운 생명체를 키우는 방법을 우리가 발견한 것 같아요.”

이 책을 써내려오면서 내 목표는 픽사와 디즈니 애니메이션 스튜디오 직원들이 어떻게 얼마나 끈질기게 비전을 실현할 방법을 모색했는지 독자들에게 알리는 것이었다. 미래는 목적지가 아니라 방향이다. 우리가 할 일은 현재의 진로가 옳은지 매일 점검하고, 길을 벗어났다면 방향을 수정하는 것이다. … 불확실성과 변화는 변수가 아니라 인생에서 피할 수 없는 상수다. 불확실성과 변화가 있기에 인생이 재미있는 것이다.

원래 경영자의 목표는 달성하기 쉬울 수 없다. 경영자의 목표는 탁월한 제품을 만드는 것이다.

Categories
Offsites

위대한 제품을 만드는 모두를 위한, 창의성을 지휘하라

픽사 애니메이션 스튜디오는 아래 이미지에서 낯익은 케릭터들이 많이 있는 만큼, 흥행한 작품들이 굉장히 많은 회사 입니다. 특히 초창기에는 작품을 낼때마다 박스오피스 1위를 할 정도였죠. 어떻게 애니메이션을 만드는지 아는 것이 별로 없지만, 창의성이 중요하다는 것은 알 수가 있습니다. 이런 창의성이 요구되는 활동을 성공적으로, 그리고 꾸준하게 해낼 수 있었던 비결은 무엇일까요?

image.png

출처: https://www.creativebloq.com/animation/job-at-pixar-10121018

그에 대한 답의 일부가 아래 애드 캣멀이 지은 ≪창의성을 지휘하라≫ 에 담겨있습니다.

이 책에는 저자 애드 캣멀이 픽사라는 회사와 함께 성장하는 과정과 애니메이션을 하나하나 성공시켜가면서, 경영자로서 스스로 무수히 많은 질문을 던지고 답을 찾아가는, 이러한 전반적인 과정들이 담겨있습니다. 그 뿐만 아니라 실질적으로 참고할 수 있도록, 픽사에서 실제 진행한 프로세스들에 대한 이야기도 포함되어 있습니다.

image.png

출처: https://www.aladin.co.kr/shop/wproduct.aspx?ItemId=45834844

경영자로서의 애드 캣멀

저는 이 책을 읽으면서 특히 저자가 경영자로서 계속해서 마주치는 문제들에 대해서 다양한 관점으로 고려하면서 답을 찾아가는 과정이 인상이 깊었습니다. 숫자를 세어본 적은 없지만, 계속해서 물음표로 글이 이어지는 부분들도 적지 않았던 것 같네요.

픽사 사장으로서 내 목표는 언제나 픽사가 창업자들(스티브 잡스, 존 래스터, 그리고 나)보다 오래 생존할 수 있게 픽사에 계속 생명력을 불어놓는 창의적 기업문화를 구축하는 것이었다. 예술과 상업이라는 상호충돌하면서도 상호보완적인 동력을 관리하느라 애를 먹고 있는 경영자들과 창의적 기업문화에 관한 철학을 공유하는 것도 내 목표다. … 나는 어떤 분야에든 사람들이 창의성을 발휘해 탁월한 성과를 내도록 이끄는 훌륭한 리더십이 필요하다고 생각한다.

  • 머리말 – 잃어버리고 되찾은 것 중에서

경영이란 이런 것이다. 타당한 이유에 따라 내린 결정이 새로운 문제를 초래하고, 이 새로운 문제를 해결하기 위해 또 다른 결정을 내려야 한다. 기업에서 발생하는 문제는 최초의 오류를 수정하는 것마으로 간단히 풀리는 법이 없다. 하나의 문제를 해결하는 일은 여러 단계를 거쳐야 하는 과정이다. 최초의 문제뿐만 아니라 여기서 파생된 문제가 무엇인지 파악해 함께 해결해야 한다.

  • Chapter 1 애니메이션과 기술의 만남

나는 오랫동안 인식의 한계에 대해 고민했다. 경영자는 항상 인식의 한계에 관한 문제로 인해 고민하게 마련이다. 우리는 어디까지 볼 수 있는가? 얼마나 많은 것을 불명확하게 인식하고 있는가? 우리가 카산드라의 경고를 무시하고 있는 건 아닐까? 다시 말해, 우리가 최선을 다해 문제를 파악하려고 노력해도 문제를 인식하지 못하는 저주를 받은 것은 아닐까? 이런 의문들의 답을 찾는 과정은 창의적 기업문화를 유지하는 열쇠다. 그리고 그 답을 탐구하는 것이 이 책의 핵심 내용이다.

  • Chapter 9 잠재적 위험에 대처하는 법 중에서

이 부분은 애드 캣멀이 어떠한 사람인지를 알 수 있는 대목이기도 합니다.

나는 다른 기업 사람들이 나와 달리 공격적이고 확신에 찬 행동을 하는 것을 보고, 내가 사장답지 않아 보일까 봐 걱정했다. 더 정확히 말하면, 나는 실패가 두려웠다.
나는 연설에서 8~9년 전에야 겨우 내가 사기꾼이 된 것 같은 기분에서 벗어날 수 있었다고 털어놨다. … “자신이 사기꾼처럼 느껴지는 분이 있나요?” 그러면 언제나 모든 직원이 손을 든다.
… 자신이 경영자로서 잘해 나가고 있는지 파악하려면, 머릿속에서 경영자란 어떤 것이어야 한다는 심성모형을 지워야 한다. 경영자의 성공을 측정하는 척도는 ‘팀원들이 잘 협력해 핵심적인 문제들을 해결하고 있는가?’ 이다.

  • Chapter 6 실패와 공포에 대처하는 법 중에서

제품을 만드는 모두를 위한,

이 책을 추천하는 이유는 픽사라는 회사를 가지고 이야기를 하고 있지만, 제품을 만드는 수 많은 회사들에도 적용할 수 있는 좋은 길잡이이기 때문입니다.

먼저 우리는 그 어느때보다 ‘창의력’이 중요시 되는 시대에 살고 있습니다. ‘지식근로자’라는 말은 1968년 피터드러커가 ≪단절의 시대≫ 을 저술하면서 처음 언급 했었습니다. 이제는 이 지식근로자라는 말이 전혀 낯설지 않은 시대에 우리는 살고 있습니다. 또한 실제로 AI가 급격히 발전하면서 단순 반복적인 일들이 대체되고 있는 상황입니다. 이에 자연스럽게, AI가 잘하지 못하는 일들이 이 지식근로자들에게 요구되고 있습니다. 이러한 일들 중 가장 기본이 되는 것이 무엇일까요? 핵심 요소 중에서 하나가 바로 창의성일 것입니다.

새로운 제품을 만드는데 있어서 이 창의성은 굉장히 중요한 요소입니다. 문제를 발견하고, 어떤 식으로 풀지 이 전반적인 과정에 모두 이 창의성이 활용되기 때문이죠. 그리고 픽사에서 애니메이션을 제작하는 것을 스타트업에서 제품을 만들어내는 것에 비유할 수 있습니다. 흔히 스타트업에서 새로운 제품을 만드는 과정을 다음과 같이 이야기 할 수 있습니다.

  • 고객을 깊이 이해하고, 주어진 문제를 풀어내서, 고객에게 전달하여 감동을 주는 것.

에니메이션을 만드는 과정 또한 비슷한 프레임을 가지고 있습니다.

  • 전달하고자 하는 스토리를 구상하고, 배경과 하나하나의 케릭터에 숨을 불어넣어서, 관객들에게 전달하여 감동을 주는 것.

전달하는 제품의 형태만 다를 뿐, 고객(관객)들에게 감동을 주는 것은 동일하기 때문입니다. 그럼 이 다음부터는 픽사가 어떤 식으로 성장해왔는지 간략하게 다뤄보겠습니다.

픽사의 정체성, 목표와 기업문화 (사람과 프로세스)

먼저 인상 깊었던 부분들 중 하나는 픽사라는 회사의 색깔을 명확하게 한 것입니다. 다음은 회사의 정체성을 구축하는 데 있어 가장 중요한 두가지 요소입니다. 첫 번째가 회사가 추구하는 목표(미션)가 무엇인지, 두 번째로는 기업 문화에 해당합니다. 픽사는 사람들에게 감동을 주는 애니메이션을 만드는 것을 목표로 하였고, 상황에 타협하지 않았고 본질을 계속 추구한 집단이기도 하였습니다.

당시 픽사는 파산 위기에 처한 신생 영화사에 불과했지만, 직원들은 신념을 공유했다. 우리가 보고 싶은 영화를 만들면 관객들도 보러 올 것이라는 신념이었다. 우리는 이 신념을 지키기 위해, 계속해서 커다란 바위를 산꼭대기로 밀어올리는 시시포스처럼 불가능한 일에 무모하게 도전하는 기분을 맛봐야 했다.
… 픽사 임직원이 자부심을 느낀 대목은 이런 숫자가 아니다. 수익은 기업의 성과를 측정하는 여러 가지 잣대 중 하나로, 가장 의미 있는 잣대라고 할 수 없다. 내가 가장 자부심을 느끼는 부분은 우리가 만든 작품의 예술적 성취도다. … 우리가 추구한 진짜 목적은 그때나 지금이나 위대한 영화를 제작하는 것이다.

  • 머리말 – 잃어버리고 되찾은 것 중에서

‘토이 스토리’를 제작하는 과정에서 픽사에 지대한 영향을 미치게 될 두 가지 창작 원칙이 만들어졌다. …
제1원칙은 “스토리가 왕이다(Story is King)” 이다. …
제2원칙은 “프로세스를 신뢰하라(Trust the Process)”다. 우리는 이 원칙을 좋아했다. 제작 과정의 고민을 덜어주는 원칙이기 때문이다.

  • Chapter 4 픽사의 정체성 구축 중에서

우리는 디즈니 중역들을 만나 극장용보다 질이 낮은 비디오 대여용 애니메이션을 제작하는 일은 픽사의 기업 성향에 맞지 않는다고 말했다. 픽사의 기업문화는 이를 용납하지 못했다. 우리는 진로를 바꾸어 ‘토이 스토리2’를 극장용 애니메이션으로 만들자고 제안했다. 놀랍게도 디즈니 중역들은 흔쾌히 승낙했다.

  • Chapter 4 픽사의 정체성 구축 중에서

기업문화에 있어서 가장 중요한 요소는 ‘사람’ 입니다. 많은 책들에서 적합한 사람, 인재밀도 등을 언급하는 것처럼, 저자 역시 사람에 대해서 특히 중요하게 여기는 것을 볼 수가 있습니다.

나는 다음과 같은 교훈을 얻었다. 좋은 아이디어를 평범한 팀에 맡기면 실망스러운 결과가 나온다. 반면 평범한 아이디어를 탁월한 팀에 맡기면, 그들은 아이디어를 수정하든 폐기하든 해서 더 나은 결과를 내놓는다.
이 교훈은 더 설명할 가치가 있다. 적합한 팀에게 일을 맡기는 것이 아이디어를 성공적으로 구현하는 선결 조건이다. 재능 있는 인재들을 원한다고 말하기는 쉽고, 경영자들 또한 재능 있는 인재들을 원하지만, 정말로 핵심 관건은 이런 인재들이 상호작용하는 방식이다. 아무리 영리한 사람들을 모아놓아도 서로 어울리지 않으면 비효율적인 팀이 된다. 경여앚가 직원 개개인의 재능이 아니라 팀이 돌아가는 상황에 초점을 맞추는 편이 낫다는 뜻이다. 좋은 팀은 서로 보완해주는 사람들로 구성된다. … 업무에 적합한 인재들이 상성이 맞는 사람들과 함께 일하도록 하는 것이 좋은 아이디어를 내는 것보다 중요하다.

  • Chapter 4 픽사의 정체성 구축 중에서

한 사람만이 내가 던진 질문에 오류가 있다고 지적했다. 아이디어는 사람에게서 나온다. 사람이 없으면 아이디어도 없다. 따라서 사람이 아이디어보다 중요하다.
왜 사람이 아이디어보다 중요하다고 생각하지 못할까? 너무나 많은 사람이 아이디어가 사람들과 완전히 분리된 채 독립적으로 형성되고 존재한다고 착각하기 때문이다. 아이디어는 독립적인 존재가 아니다. 아니디어는 종종 수십 명이 관여하는 수만 가지 의사결정을 통해 형성된다. … 영화는 하나의 아이디어만으로 만들어질 수 없다. 영화는 여러 아이디어의 집합체다. 이런 아이디어들을 구상하고 현실로 구현하는 것은 결국 사람이다. 모든 제품이 마찬가지다. …
다시 말해, 사람(직원들의 근무 습관, 재능, 가치)에게 초점을 맞추는 것이 모든 창조적 사업의 핵심 성공 비결이다. 제작 과정에서 나는 이런 사실을 이전보다 명확히 인식하게 됐다. .. 이후 픽사는 사람을 우선하는 경영 모델을 꾸준히 실천하고 있다. 세부적인 수정은 있었지만, 근본 원리는 언제나 같다. 좋은 인재를 육성하고 지원하면 그들이 좋은 아이디어를 내놓는다는 것이다.토이>

  • Chapter 4 픽사의 정체성 구축 중에서

다음으로는 뛰어난 사람들이 각자의 역량을 펼칠 수 있도록, 서로 효율적으로 협력하는 방법들에 대한 고민을 계속 하는 것을 볼 수 있습니다. 이는 제품의 질이 직원들에게 달려있다는 것을 전적으로 믿기 때문이죠.

당시 일본 제조업에서 일어난 품질관리 혁명을 설명하기 위해 적기공급생산(just-in-time manufacturing), 전사적 품질관리(total quality control) 같은 경영학 용어들이 나왔다. 이런 용어들이 설명하려는 아이디어는 간단하다. 즉, 문제를 파악해 수정할 권한을 고위 간부부터 생산라인 말단직원까지 모든 임직원에게 부여해야 한다는 것이다. 에드워드 데밍은 어떤 직급의 직원이라도 제조 과정에서 문제를 발견하면 조립라인을 멈추도록 장려해야 한다고 생각했다. …
에드워드 데밍과 도요타의 접근법은 제품 생산 과정에 밀접하게 관여하는 사람들에게 제품의 품질을 높일 권한과 책임을 부여했다. 이 과정에서 일본 근로자들은 자신이 단지 컨베이어벨트 위를 지나가는 부품들을 조립하는, 영혼 없는 톱니바퀴 같은 존재가 아니라, 제품 생산 과정의 문제를 지적하고, 변화를 제안하고, 문제 해결에 기여해 회사를 키우는 구성원이라는 ‘자부심’을 느꼈다(나는 특히 마지막 대목이 중요하다고 생각한다).
… 에드워드 데밍의 품질관리 이론은 ‘모든 직원은 먼저 허락받지 않은 채, 문제 해결에 나설 수 있어야 한다’ 라는 민주적 원칙에 기반을 두고 있다. … 경영진이 단기이익 극대화에 집착하자 자부심을 가지고 품질 향상에 힘쓰던 직원들은 영혼을 잃은 채 일하게 됐고, 그 결과 품질 결함이 발생하기 시작했다.

  • Chapter 3 의 탄생과 목표의 재정립 중에서토이>

이러한 고민들의 결과로서 창작 제 2원칙으로 “프로세스를 신뢰하라” 가 있는 것처럼, 픽사에서는 좋은 작품을 만들기 위한 다양한 프로세스를 정립하기 위한 많은 시도들이 있었습니다. 아래는 픽사라는 회사가 어느정도 큰 큐모를 갖춘 이후에 정립된 프로세스을 나열해둔 것 입니다.

  1. 데일리스 회의
  2. 현장답사
  3. 한도 결정
  4. 기술과 예술의 융합
  5. 소규모 실험
  6. 보는 법 배우기
  7. 사후분석 회의
  8. 픽사대학

프로세스들을 하나하나 보다보면, 이러한 생각이 들기도 합니다. ‘우리도 이런 것들은 하고 있는 거 같은데?’ 하지만 자세히 들여다보면 픽사가 구축해온 기업문화를 기반으로 위 프로세스들이 일반적인 기업들과는 다르게 돌아가는 것을 이해할 수 있습니다. 최근에 읽었던 넷플릭스의 ≪규칙없음≫ 에서 자유와 책임이 주어지는 것처럼, 픽사에서 역시 개개인에게 자유와 책임이 주어지는 것을 느낄 수 있었습니다.

‘픽사의 브레인트러스트가 다른 기업의 피드백 매커니즘과 다른 게 뭐야?’ … 첫째, 픽사의 브레인트러스트는 스토리텔링을 심도 있게 이해하는 사람들, 대게 작품 제작에 참여해본 경험이 있는 사람들로 구성된다. 픽사 감독들은 다양한 사람들의 비평을 환영하지만(사실 모든 픽사 직원들은 중간결과물을 보고 의견서를 보내야 한다), 특히 동료 감독, 각본가가 보낸 피드백을 더 진지하게 받아들인다. 둘째, 픽사의 브레인트러스트는 지시할 권한이 없다. 이는 중요한 차이다. 감독은 브레인트러스트의 특정 제안을 꼭 받아들여야 할 필요가 없다. 브레인트러스트 회의 후, 브레인트러스트의 피드백을 어떻게 받아들일지 결정하는 것은 감독의 몫이다.

  • Chapter 5 솔직함의 가치 중에서

위의 프로세스 외에도 인턴 프로그램, 전사 워크샵인 노트 데이 등에 대한 이야기가 담겨 있습니다. 특히 노트데이에서는 단순히 직원들의 참여를 유도하는 것이 아니라 토론회가 제대로 돌아갈 수 있도록 설계하는 과정까지 담겨 있어서 흥미롭게 볼 수 있었습니다.

제작비를 10퍼센트 감축하는 방안을 고민하는 자리에서 귀도 쿼로니는 간단한 제안을 했다.
“모든 직원에게 비용 삭감 아이디어를 내달라고 요청합시다.”
존 래스터는 흥분한 얼굴로 말했다. “흥미로운 제안이군요. 하루 동안 모든 픽사 직원의 업무를 중단하고, 이 문제를 해결할 방안을 얘기해보는 게 어떨까요?”

  • Chapter 13 노트 데이 토론회 중에서

끝으로

수 많은 스타트업을 꿈꾸는 분들에게, 특히 정체성이 명확한 회사를 만들고 싶은 분들에게 좋은 길잡이가 되어줄 책이라고 생각을 합니다. 마지막으로 존 래스터가 기업문화에 대해서 이야기한 문장과 애드 캣멀의 경영자의 목표로 책 소개를 마무리 하고자 합니다.

존 래스터는 종종 픽사의 조직문화를 살아 있는 유기체에 비유했다. “지구상에 출현한 적 없는 새로운 생명체를 키우는 방법을 우리가 발견한 것 같아요.”

이 책을 써내려오면서 내 목표는 픽사와 디즈니 애니메이션 스튜디오 직원들이 어떻게 얼마나 끈질기게 비전을 실현할 방법을 모색했는지 독자들에게 알리는 것이었다. 미래는 목적지가 아니라 방향이다. 우리가 할 일은 현재의 진로가 옳은지 매일 점검하고, 길을 벗어났다면 방향을 수정하는 것이다. … 불확실성과 변화는 변수가 아니라 인생에서 피할 수 없는 상수다. 불확실성과 변화가 있기에 인생이 재미있는 것이다.

원래 경영자의 목표는 달성하기 쉬울 수 없다. 경영자의 목표는 탁월한 제품을 만드는 것이다.

Categories
Offsites

Constructing Transformers For Longer Sequences with Sparse Attention Methods

Natural language processing (NLP) models based on Transformers, such as BERT, RoBERTa, T5, or GPT3, are successful for a wide variety of tasks and a mainstay of modern NLP research. The versatility and robustness of Transformers are the primary drivers behind their wide-scale adoption, leading them to be easily adapted for a diverse range of sequence-based tasks — as a seq2seq model for translation, summarization, generation, and others, or as a standalone encoder for sentiment analysis, POS tagging, machine reading comprehension, etc. The key innovation in Transformers is the introduction of a self-attention mechanism, which computes similarity scores for all pairs of positions in an input sequence, and can be evaluated in parallel for each token of the input sequence, avoiding the sequential dependency of recurrent neural networks, and enabling Transformers to vastly outperform previous sequence models like LSTM.

A limitation of existing Transformer models and their derivatives, however, is that the full self-attention mechanism has computational and memory requirements that are quadratic with the input sequence length. With commonly available current hardware and model sizes, this typically limits the input sequence to roughly 512 tokens, and prevents Transformers from being directly applicable to tasks that require larger context, like question answering, document summarization or genome fragment classification. Two natural questions arise: 1) Can we achieve the empirical benefits of quadratic full Transformers using sparse models with computational and memory requirements that scale linearly with the input sequence length? 2) Is it possible to show theoretically that these linear Transformers preserve the expressivity and flexibility of the quadratic full Transformers?

We address both of these questions in a recent pair of papers. In “ETC: Encoding Long and Structured Inputs in Transformers”, presented at EMNLP 2020, we present the Extended Transformer Construction (ETC), which is a novel method for sparse attention, in which one uses structural information to limit the number of computed pairs of similarity scores. This reduces the quadratic dependency on input length to linear and yields strong empirical results in the NLP domain. Then, in “Big Bird: Transformers for Longer Sequences”, presented at NeurIPS 2020, we introduce another sparse attention method, called BigBird that extends ETC to more generic scenarios where prerequisite domain knowledge about structure present in the source data may be unavailable. Moreover, we also show that theoretically our proposed sparse attention mechanism preserves the expressivity and flexibility of the quadratic full Transformers. Our proposed methods achieve a new state of the art on challenging long-sequence tasks, including question answering, document summarization and genome fragment classification.

Attention as a Graph
The attention module used in Transformer models computes similarity scores for all pairs of positions in an input sequence. It is useful to think of the attention mechanism as a directed graph, with tokens represented by nodes and the similarity score computed between a pair of tokens represented by an edge. In this view, the full attention model is a complete graph. The core idea behind our approach is to carefully design sparse graphs, such that one only computes a linear number of similarity scores.

Full attention can be viewed as a complete graph.

Extended Transformer Construction (ETC)
On NLP tasks that require long and structured inputs, we propose a structured sparse attention mechanism, which we call Extended Transformer Construction (ETC). To achieve structured sparsification of self attention, we developed the global-local attention mechanism. Here the input to the Transformer is split into two parts: a global input where tokens have unrestricted attention, and a long input where tokens can only attend to either the global input or to a local neighborhood. This achieves linear scaling of attention, which allows ETC to significantly scale input length.

In order to further exploit the structure of long documents, ETC combines additional ideas: representing the positional information of the tokens in a relative way, rather than using their absolute position in the sequence; using an additional training objective beyond the usual masked language model (MLM) used in models like BERT; and flexible masking of tokens to control which tokens can attend to which other tokens. For example, given a long selection of text, a global token is applied to each sentence, which connects to all tokens within the sentence, and a global token is also applied to each paragraph, which connects to all tokens within the same paragraph.

An example of document structure based sparse attention of ETC model. The global variables are denoted by C (in blue) for paragraph, S (yellow) for sentence while the local variables are denoted by X (grey) for tokens corresponding to the long input.

With this approach, we report state-of-the-art results in five challenging NLP datasets requiring long or structured inputs: TriviaQA, Natural Questions (NQ), HotpotQA, WikiHop, and OpenKP.

Test set result on Question Answering. For both verified TriviaQA and WikiHop, using ETC achieved a new state of the art.

BigBird
Extending the work of ETC, we propose BigBird — a sparse attention mechanism that is also linear in the number of tokens and is a generic replacement for the attention mechanism used in Transformers. In contrast to ETC, BigBird doesn’t require any prerequisite knowledge about structure present in the source data. Sparse attention in the BigBird model consists of three main parts:

  • A set of global tokens attending to all parts of the input sequence
  • All tokens attending to a set of local neighboring tokens
  • All tokens attending to a set of random tokens
BigBird sparse attention can be seen as adding few global tokens on Watts-Strogatz graph.

In the BigBird paper, we explain why sparse attention is sufficient to approximate quadratic attention, partially explaining why ETC was successful. A crucial observation is that there is an inherent tension between how few similarity scores one computes and the flow of information between different nodes (i.e., the ability of one token to influence each other). Global tokens serve as a conduit for information flow and we prove that sparse attention mechanisms with global tokens can be as powerful as the full attention model. In particular, we show that BigBird is as expressive as the original Transformer, is computationally universal (following the work of Yun et al. and Perez et al.), and is a universal approximator of continuous functions. Furthermore, our proof suggests that the use of random graphs can further help ease the flow of information — motivating the use of the random attention component.

This design scales to much longer sequence lengths for both structured and unstructured tasks. Further scaling can be achieved by using gradient checkpointing by trading off training time for sequence length. This lets us extend our efficient sparse transformers to include generative tasks that require an encoder and a decoder, such as long document summarization, on which we achieve a new state of the art.

Summarization ROUGE score for long documents. Both for BigPatent and ArXiv datasets, we achieve a new state of the art result.

Moreover, the fact that BigBird is a generic replacement also allows it to be extended to new domains without pre-existing domain knowledge. In particular, we introduce a novel application of Transformer-based models where long contexts are beneficial — extracting contextual representations of genomic sequences (DNA). With longer masked language model pre-training, BigBird achieves state-of-the-art performance on downstream tasks, such as promoter-region prediction and chromatin profile prediction.

On multiple genomics tasks, such as promoter region prediction (PRP), chromatin-profile prediction including transcription factors (TF), histone-mark (HM) and DNase I hypersensitive (DHS) detection, we outperform baselines. Moreover our results show that Transformer models can be applied to multiple genomics tasks that are currently underexplored.

Main Implementation Idea
One of the main impediments to the large scale adoption of sparse attention is the fact that sparse operations are quite inefficient in modern hardware. Behind both ETC and BigBird, one of our key innovations is to make an efficient implementation of the sparse attention mechanism. As modern hardware accelerators like GPUs and TPUs excel using coalesced memory operations, which load blocks of contiguous bytes at once, it is not efficient to have small sporadic look-ups caused by a sliding window (for local attention) or random element queries (random attention). Instead we transform the sparse local and random attention into dense tensor operations to take full advantage of modern single instruction, multiple data (SIMD) hardware.

To do this, we first “blockify” the attention mechanism to better leverage GPUs/TPUs, which are designed to operate on blocks. Then we convert the sparse attention mechanism computation into a dense tensor product through a series of simple matrix operations such as reshape, roll, and gather, as illustrated in the animation below.

Illustration of how sparse window attention is efficiently computed using roll and reshape, and without small sporadic look-ups.

Recently, “Long Range Arena: A Benchmark for Efficient Transformers“ provided a benchmark of six tasks that require longer context, and performed experiments to benchmark all existing long range transformers. The results show that the BigBird model, unlike its counterparts, clearly reduces memory consumption without sacrificing performance.

Conclusion
We show that carefully designed sparse attention can be as expressive and flexible as the original full attention model. Along with theoretical guarantees, we provide a very efficient implementation which allows us to scale to much longer inputs. As a consequence, we achieve state-of-the-art results for question answering, document summarization and genome fragment classification. Given the generic nature of our sparse attention, the approach should be applicable to many other tasks like program synthesis and long form open domain question answering. We have open sourced the code for both ETC (github) and BigBird (github), both of which run efficiently for long sequences on both GPUs and TPUs.

Acknowledgements
This research resulted as a collaboration with Amr Ahmed, Joshua Ainslie, Chris Alberti, Vaclav Cvicek, Avinava Dubey, Zachary Fisher, Guru Guruganesh, Santiago Ontañón, Philip Pham, Anirudh Ravula, Sumit Sanghai, Qifan Wang, Li Yang, Manzil Zaheer, who co-authored EMNLP and NeurIPS papers.

Categories
Offsites

sparklyr 1.6: weighted quantile summaries, power iteration clustering, spark_write_rds(), and more

Sparklyr 1.6 is now available on CRAN!

To install sparklyr 1.6 from CRAN, run

install.packages("sparklyr")

In this blog post, we shall highlight the following features and enhancements from sparklyr 1.6:

Weighted quantile summaries

Apache Spark is well-known for supporting approximate algorithms that trade off marginal amounts of accuracy for greater speed and parallelism. Such algorithms are particularly beneficial for performing preliminary data explorations at scale, as they enable users to quickly query certain estimated statistics within a predefined error margin, while avoiding the high cost of exact computations. One example is the Greenwald-Khanna algorithm for on-line computation of quantile summaries, as described in Greenwald and Khanna (2001). This algorithm was originally designed for efficient (epsilon)- approximation of quantiles within a large dataset without the notion of data points carrying different weights, and the unweighted version of it has been implemented as approxQuantile() since Spark 2.0. However, the same algorithm can be generalized to handle weighted inputs, and as sparklyr user @Zhuk66 mentioned in this issue, a weighted version of this algorithm makes for a useful sparklyr feature.

To properly explain what weighted-quantile means, we must clarify what the weight of each data point signifies. For example, if we have a sequence of observations ((1, 1, 1, 1, 0, 2, -1, -1)), and would like to approximate the median of all data points, then we have the following two options:

  • Either run the unweighted version of approxQuantile() in Spark to scan through all 8 data points

  • Or alternatively, “compress” the data into 4 tuples of (value, weight): ((1, 0.5), (0, 0.125), (2, 0.125), (-1, 0.25)), where the second component of each tuple represents how often a value occurs relative to the rest of the observed values, and then find the median by scanning through the 4 tuples using the weighted version of the Greenwald-Khanna algorithm

We can also run through a contrived example involving the standard normal distribution to illustrate the power of weighted quantile estimation in sparklyr 1.6. Suppose we cannot simply run qnorm() in R to evaluate the quantile function of the standard normal distribution at (p = 0.25) and (p = 0.75), how can we get some vague idea about the 1st and 3rd quantiles of this distribution? One way is to sample a large number of data points from this distribution, and then apply the Greenwald-Khanna algorithm to our unweighted samples, as shown below:

library(sparklyr)

sc <- spark_connect(master = "local")

num_samples <- 1e6
samples <- data.frame(x = rnorm(num_samples))

samples_sdf <- copy_to(sc, samples, name = random_string())

samples_sdf %>%
  sdf_quantile(
    column = "x",
    probabilities = c(0.25, 0.75),
    relative.error = 0.01
  ) %>%
  print()
##        25%        75%
## -0.6629242  0.6874939

Notice that because we are working with an approximate algorithm, and have specified relative.error = 0.01, the estimated value of (-0.6629242) from above could be anywhere between the 24th and the 26th percentile of all samples. In fact, it falls in the (25.36896)-th percentile:

pnorm(-0.6629242)
## [1] 0.2536896

Now how can we make use of weighted quantile estimation from sparklyr 1.6 to obtain similar results? Simple! We can sample a large number of (x) values uniformly randomly from ((-infty, infty)) (or alternatively, just select a large number of values evenly spaced between ((-M, M)) where (M) is approximately (infty)), and assign each (x) value a weight of (displaystyle frac{1}{sqrt{2 pi}}e^{-frac{x^2}{2}}), the standard normal distribution’s probability density at (x). Finally, we run the weighted version of sdf_quantile() from sparklyr 1.6, as shown below:

library(sparklyr)

sc <- spark_connect(master = "local")

num_samples <- 1e6
M <- 1000
samples <- tibble::tibble(
  x = M * seq(-num_samples / 2 + 1, num_samples / 2) / num_samples,
  weight = dnorm(x)
)

samples_sdf <- copy_to(sc, samples, name = random_string())

samples_sdf %>%
  sdf_quantile(
    column = "x",
    weight.column = "weight",
    probabilities = c(0.25, 0.75),
    relative.error = 0.01
  ) %>%
  print()
##    25%    75%
## -0.696  0.662

Voilà! The estimates are not too far off from the 25th and 75th percentiles (in relation to our abovementioned maximum permissible error of (0.01)):

pnorm(-0.696)
## [1] 0.2432144
pnorm(0.662)
## [1] 0.7460144

Power iteration clustering

Power iteration clustering (PIC), a simple and scalable graph clustering method presented in Lin and Cohen (2010), first finds a low-dimensional embedding of a dataset, using truncated power iteration on a normalized pairwise-similarity matrix of all data points, and then uses this embedding as the “cluster indicator”, an intermediate representation of the dataset that leads to fast convergence when used as input to k-means clustering. This process is very well illustrated in figure 1 of Lin and Cohen (2010) (reproduced below)

in which the leftmost image is the visualization of a dataset consisting of 3 circles, with points colored in red, green, and blue indicating clustering results, and the subsequent images show the power iteration process gradually transforming the original set of points into what appears to be three disjoint line segments, an intermediate representation that can be rapidly separated into 3 clusters using k-means clustering with (k = 3).

In sparklyr 1.6, ml_power_iteration() was implemented to make the PIC functionality in Spark accessible from R. It expects as input a 3-column Spark dataframe that represents a pairwise-similarity matrix of all data points. Two of the columns in this dataframe should contain 0-based row and column indices, and the third column should hold the corresponding similarity measure. In the example below, we will see a dataset consisting of two circles being easily separated into two clusters by ml_power_iteration(), with the Gaussian kernel being used as the similarity measure between any 2 points:

gen_similarity_matrix <- function() {
  # Guassian similarity measure
  guassian_similarity <- function(pt1, pt2) {
    exp(-sum((pt2 - pt1) ^ 2) / 2)
  }
  # generate evenly distributed points on a circle centered at the origin
  gen_circle <- function(radius, num_pts) {
    seq(0, num_pts - 1) %>%
      purrr::map_dfr(
        function(idx) {
          theta <- 2 * pi * idx / num_pts
          radius * c(x = cos(theta), y = sin(theta))
        })
  }
  # generate points on both circles
  pts <- rbind(
    gen_circle(radius = 1, num_pts = 80),
    gen_circle(radius = 4, num_pts = 80)
  )
  # populate the pairwise similarity matrix (stored as a 3-column dataframe)
  similarity_matrix <- data.frame()
  for (i in seq(2, nrow(pts)))
    similarity_matrix <- similarity_matrix %>%
      rbind(seq(i - 1L) %>%
        purrr::map_dfr(~ list(
          src = i - 1L, dst = .x - 1L,
          similarity = guassian_similarity(pts[i,], pts[.x,])
        ))
      )

  similarity_matrix
}

library(sparklyr)

sc <- spark_connect(master = "local")
sdf <- copy_to(sc, gen_similarity_matrix())
clusters <- ml_power_iteration(
  sdf, k = 2, max_iter = 10, init_mode = "degree",
  src_col = "src", dst_col = "dst", weight_col = "similarity"
)

clusters %>% print(n = 160)
## # A tibble: 160 x 2
##        id cluster
##     <dbl>   <int>
##   1     0       1
##   2     1       1
##   3     2       1
##   4     3       1
##   5     4       1
##   ...
##   157   156       0
##   158   157       0
##   159   158       0
##   160   159       0

The output shows points from the two circles being assigned to separate clusters, as expected, after only a small number of PIC iterations.

spark_write_rds() + collect_from_rds()

spark_write_rds() and collect_from_rds() are implemented as a less memory- consuming alternative to collect(). Unlike collect(), which retrieves all elements of a Spark dataframe through the Spark driver node, hence potentially causing slowness or out-of-memory failures when collecting large amounts of data, spark_write_rds(), when used in conjunction with collect_from_rds(), can retrieve all partitions of a Spark dataframe directly from Spark workers, rather than through the Spark driver node. First, spark_write_rds() will distribute the tasks of serializing Spark dataframe partitions in RDS version 2 format among Spark workers. Spark workers can then process multiple partitions in parallel, each handling one partition at a time and persisting the RDS output directly to disk, rather than sending dataframe partitions to the Spark driver node. Finally, the RDS outputs can be re-assembled to R dataframes using collect_from_rds().

Shown below is an example of spark_write_rds() + collect_from_rds() usage, where RDS outputs are first saved to HDFS, then downloaded to the local filesystem with hadoop fs -get, and finally, post-processed with collect_from_rds():

library(sparklyr)
library(nycflights13)

num_partitions <- 10L
sc <- spark_connect(master = "yarn", spark_home = "/usr/lib/spark")
flights_sdf <- copy_to(sc, flights, repartition = num_partitions)

# Spark workers serialize all partition in RDS format in parallel and write RDS
# outputs to HDFS
spark_write_rds(
  flights_sdf,
  dest_uri = "hdfs://<namenode>:8020/flights-part-{partitionId}.rds"
)

# Run `hadoop fs -get` to download RDS files from HDFS to local file system
for (partition in seq(num_partitions) - 1)
  system2(
    "hadoop",
    c("fs", "-get", sprintf("hdfs://<namenode>:8020/flights-part-%d.rds", partition))
  )

# Post-process RDS outputs
partitions <- seq(num_partitions) - 1 %>%
  lapply(function(partition) collect_from_rds(sprintf("flights-part-%d.rds", partition)))

# Optionally, call `rbind()` to combine data from all partitions into a single R dataframe
flights_df <- do.call(rbind, partitions)

Dplyr-related improvements

Similar to other recent sparklyr releases, sparklyr 1.6 comes with a number of dplyr-related improvements, such as

  • Support for where() predicate within select() and summarize(across(…)) operations on Spark dataframes
  • Addition of if_all() and if_any() functions
  • Full compatibility with dbplyr 2.0 backend API

select(where(…)) and summarize(across(where(…)))

The dplyr where(…) construct is useful for applying a selection or aggregation function to multiple columns that satisfy some boolean predicate. For example,

library(dplyr)

iris %>% select(where(is.numeric))

returns all numeric columns from the iris dataset, and

library(dplyr)

iris %>% summarize(across(where(is.numeric), mean))

computes the average of each numeric column.

In sparklyr 1.6, both types of operations can be applied to Spark dataframes, e.g.,

library(dplyr)
library(sparklyr)

sc <- spark_connect(master = "local")
iris_sdf <- copy_to(sc, iris, name = random_string())

iris_sdf %>% select(where(is.numeric))

iris %>% summarize(across(where(is.numeric), mean))

if_all() and if_any()

if_all() and if_any() are two convenience functions from dplyr 1.0.4 (see here for more details) that effectively1 combine the results of applying a boolean predicate to a tidy selection of columns using the logical and/or operators.

Starting from sparklyr 1.6, if_all() and if_any() can also be applied to Spark dataframes, .e.g.,

library(dplyr)
library(sparklyr)

sc <- spark_connect(master = "local")
iris_sdf <- copy_to(sc, iris, name = random_string())

# Select all records with Petal.Width > 2 and Petal.Length > 2
iris_sdf %>% filter(if_all(starts_with("Petal"), ~ .x > 2))

# Select all records with Petal.Width > 5 or Petal.Length > 5
iris_sdf %>% filter(if_any(starts_with("Petal"), ~ .x > 5))

Compatibility with dbplyr 2.0 backend API

Sparklyr 1.6 is fully compatible with the newer dbplyr 2.0 backend API (by implementing all interface changes recommended in here), while still maintaining backward compatibility with the previous edition of dbplyr API, so that sparklyr users will not be forced to switch to any particular version of dbplyr.

This should be a mostly non-user-visible change as of now. In fact, the only discernible behavior change will be the following code

library(dbplyr)
library(sparklyr)

sc <- spark_connect(master = "local")

print(dbplyr_edition(sc))

outputting

[1] 2

if sparklyr is working with dbplyr 2.0+, and

[1] 1

if otherwise.

Acknowledgements

In chronological order, we would like to thank the following contributors for making sparklyr 1.6 awesome:

We would also like to give a big shout-out to the wonderful open-source community behind sparklyr, without whom we would not have benefitted from numerous sparklyr-related bug reports and feature suggestions.

Finally, the author of this blog post also very much appreciates the highly valuable editorial suggestions from @skeydan.

If you wish to learn more about sparklyr, we recommend checking out sparklyr.ai, spark.rstudio.com, and also some previous sparklyr release posts such as sparklyr 1.5 and sparklyr 1.4.

That is all. Thanks for reading!

Greenwald, Michael, and Sanjeev Khanna. 2001. “Space-Efficient Online Computation of Quantile Summaries.” SIGMOD Rec. 30 (2): 58–66. https://doi.org/10.1145/376284.375670.

Lin, Frank, and William Cohen. 2010. “Power Iteration Clustering.” In, 655–62.

  1. modulo possible implementation-dependent short-circuit evaluations↩︎

Categories
Offsites

Recursive Classification: Replacing Rewards with Examples in RL

A general goal of robotics research is to design systems that can assist in a variety of tasks that can potentially improve daily life. Most reinforcement learning algorithms for teaching agents to perform new tasks require a reward function, which provides positive feedback to the agent for taking actions that lead to good outcomes. However, actually specifying these reward functions can be quite tedious and can be very difficult to define for situations without a clear objective, such as whether a room is clean or if a door is sufficiently shut. Even for tasks that are easy to describe, actually measuring whether the task has been solved can be difficult and may require adding many sensors to a robot’s environment.

Alternatively, training a model using examples, called example-based control, has the potential to overcome the limitations of approaches that rely on traditional reward functions. This new problem statement is most similar to prior methods based on “success detectors”, and efficient algorithms for example-based control could enable non-expert users to teach robots to perform new tasks, without the need for coding expertise, knowledge of reward function design, or the installation of environmental sensors.

In “Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification,” we propose a machine learning algorithm for teaching agents how to solve new tasks by providing examples of success (e.g., if “success” examples show a nail embedded into a wall, the agent will learn to pick up a hammer and knock nails into the wall). This algorithm, recursive classification of examples (RCE), does not rely on hand-crafted reward functions, distance functions, or features, but rather learns to solve tasks directly from data, requiring the agent to learn how to solve the entire task by itself, without requiring examples of any intermediate states. Using a version of temporal difference learning — similar to Q-learning, but replacing the typical reward function term using only examples of success — RCE outperforms prior approaches based on imitation learning on simulated robotics tasks. Coupled with theoretical guarantees similar to those for reward-based learning, the proposed method offers a user-friendly alternative for teaching robots new tasks.

Top: To teach a robot to hammer a nail into a wall, most reinforcement learning algorithms require that the user define a reward function. Bottom: The example-based control method uses examples of what the world looks like when a task is completed to teach the robot to solve the task, e.g., examples where the nail is already hammered into the wall.

Example-Based Control vs Imitation Learning
While the example-based control method is similar to imitation learning, there is an important distinction — it does not require expert demonstrations. In fact, the user can actually be quite bad at performing the task themselves, as long as they can look back and pick out the small fraction of states where they did happen to solve the task.

Additionally, whereas previous research used a stage-wise approach in which the model first uses success examples to learn a reward function and then applies that reward function with an off-the-shelf reinforcement learning algorithm, RCE learns directly from the examples and skips the intermediate step of defining the reward function. Doing so avoids potential bugs and bypasses the process of defining the hyperparameters associated with learning a reward function (such as how often to update the reward function or how to regularize it) and, when debugging, removes the need to examine code related to learning the reward function.

Recursive Classification of Examples
The intuition behind the RCE approach is simple: the model should predict whether the agent will solve the task in the future, given the current state of the world and the action that the agent is taking. If there were data that specified which state-action pairs lead to future success and which state-action pairs lead to future failure, then one could solve this problem using standard supervised learning. However, when the only data available consists of success examples, the system doesn’t know which states and actions led to success, and while the system also has experience interacting with the environment, this experience isn’t labeled as leading to success or not.

Left: The key idea is to learn a future success classifier that predicts for every state (circle) in a trajectory whether the task will be solved in the future (thumbs up/down). Right: In the example-based control approach, the model is provided only with unlabeled experience (grey circles) and success examples (green circles), so one cannot apply standard supervised learning. Instead, the model uses the success examples to automatically label the unlabeled experience.

Nonetheless, one can piece together what these data would look like, if it were available. First, by definition, a successful example must be one that solves the given task. Second, even though it is unknown whether an arbitrary state-action pair will lead to success in solving a task, it is possible to estimate how likely it is that the task will be solved if the agent started at the next state. If the next state is likely to lead to future success, it can be assumed that the current state is also likely to lead to future success. In effect, this is recursive classification, where the labels are inferred based on predictions at the next time step.

The underlying algorithmic idea of using a model’s predictions at a future time step as a label for the current time step closely resembles existing temporal-difference methods, such as Q-learning and successor features. The key difference is that the approach described here does not require a reward function. Nonetheless, we show that this method inherits many of the same theoretical convergence guarantees as temporal difference methods. In practice, implementing RCE requires changing only a few lines of code in an existing Q-learning implementation.

Evaluation
We evaluated the RCE method on a range of challenging robotic manipulation tasks. For example, in one task we required a robotic hand to pick up a hammer and hit a nail into a board. Previous research into this task [1, 2] have used a complex reward function (with terms corresponding to the distance between the hand and the hammer, the distance between the hammer and the nail, and whether the nail has been knocked into the board). In contrast, the RCE method requires only a few observations of what the world would look like if the nail were hammered into the board.

We compared the performance of RCE to a number of prior methods, including those that learn an explicit reward function and those based on imitation learning , all of which struggle to solve this task. This experiment highlights how example-based control makes it easy for users to specify even complex tasks, and demonstrates that recursive classification can successfully solve these sorts of tasks.

Compared with prior methods, the RCE approach solves the task of hammering a nail into a board more reliably that prior approaches based on imitation learning [SQIL, DAC] and those that learn an explicit reward function [VICE, ORIL, PURL].

Conclusion
We have presented a method to teach autonomous agents to perform tasks by providing them with examples of success, rather than meticulously designing reward functions or collecting first-person demonstrations. An important aspect of example-based control, which we discuss in the paper, is what assumptions the system makes about the capabilities of different users. Designing variants of RCE that are robust to differences in users’ capabilities may be important for applications in real-world robotics. The code is available, and the project website contains additional videos of the learned behaviors.

Acknowledgements
We thank our co-authors, Ruslan Salakhutdinov and Sergey Levine. We also thank Surya Bhupatiraju, Kamyar Ghasemipour, Max Igl, and Harini Kannan for feedback on this post, and Tom Small for helping to design figures for this post.

Categories
Offsites

Progress and Challenges in Long-Form Open-Domain Question Answering

Open-domain long-form question answering (LFQA) is a fundamental challenge in natural language processing (NLP) that involves retrieving documents relevant to a given question and using them to generate an elaborate paragraph-length answer. While there has been remarkable recent progress in factoid open-domain question answering (QA), where a short phrase or entity is enough to answer a question, much less work has been done in the area of long-form question answering. LFQA is nevertheless an important task, especially because it provides a testbed to measure the factuality of generative text models. But, are current benchmarks and evaluation metrics really suitable for making progress on LFQA?

In “Hurdles to Progress in Long-form Question Answering” (to appear at NAACL 2021), we present a new system for open-domain long-form question answering that leverages two recent advances in NLP: 1) state-of-the-art sparse attention models, such as Routing Transformer (RT), which allow attention-based models to scale to long sequences, and 2) retrieval-based models, such as REALM, which facilitate retrievals of Wikipedia articles related to a given query. To encourage more factual grounding, our system combines information from several retrieved Wikipedia articles related to the given question before generating an answer. It achieves a new state of the art on ELI5, the only large-scale publicly available dataset for long-form question answering.

However, while our system tops the public leaderboard, we discover several troubling trends with the ELI5 dataset and its associated evaluation metrics. In particular, we find 1) little evidence that models actually use the retrievals on which they condition; 2) that trivial baselines (e.g., input copying) beat modern systems, like RAG / BART+DPR; and 3) that there is a significant train/validation overlap in the dataset. Our paper suggests mitigation strategies for each of these issues.

Text Generation
The main workhorse of NLP models is the Transformer architecture, in which each token in a sequence attends to every other token in a sequence, resulting in a model that scales quadratically with sequence length. The RT model introduces a dynamic, content-based sparse attention mechanism that reduces the complexity of attention in the Transformer model from n2to n1.5, where n is the sequence length, which enables it to scale to long sequences. This allows each word to attend to other relevant words anywhere in the entire piece of text, unlike methods such as Transformer-XL where a word can only attend to words in its immediate vicinity.

The key insight of the RT work is that each token attending to every other token is often redundant, and may be approximated by a combination of local and global attention. Local attention allows each token to build up a local representation over several layers of the model, where each token attends to a local neighborhood, facilitating local consistency and fluency. Complementing local attention, the RT model also uses mini-batch k-means clustering to enable each token to attend only to a set of most relevant tokens.

Attention maps for the content-based sparse attention mechanism used in Routing Transformer. The word sequence is represented by the diagonal dark colored squares. In the Transformer model (left), each token attends to every other token. The shaded squares represent the tokens in the sequence to which a given token (the dark square) is attending. The RT model uses both local attention (middle), where tokens attend only to other tokens in their local neighborhood, and routing attention (right), in which a token only attends to clusters of tokens most relevant to it in context. The dark red, green and blue tokens only attend to the corresponding color of lightly shaded tokens.

We pre-train an RT model on the Project Gutenberg (PG-19) data-set with a language modeling objective, i.e, the model learns to predict the next word given all the previous words, so as to be able to generate fluent paragraph long text.

Information Retrieval
To demonstrate the effectiveness of the RT model on the task of LFQA, we combine it with retrievals from REALM. The REALM model (Guu et al. 2020) is a retrieval-based model that uses the maximum inner product search to retrieve Wikipedia articles relevant to a particular query or question. The model was fine-tuned for factoid-based question answering on the Natural Questions dataset. REALM utilizes the BERT model to learn good representations for a question and uses SCANN to retrieve Wikipedia articles that have a high topical similarity with the question representation. This is then trained end-to-end to maximize the log-likelihood on the QA task.

We further improve the quality of REALM retrievals by using a contrastive loss. The idea behind this is to encourage the representation of a question to get close to its ground truth answer and diverge from the other answers in its mini-batch. This ensures that when the system retrieves relevant items using this question representation, it returns articles that are “similar” to ground truth answers. We call this retriever contrastive-REALM or c-REALM.

The combined RT + c-REALM system for LFQA.

Evaluation
We test the model on long-form question answering using the ELI5 dataset, which is a part of the KILT benchmark, and is the only publicly available large-scale LFQA dataset. The KILT benchmark measures text retrievals using Precision (R-Prec) and text generation using ROUGE-L. The two scores are combined to give a KILT R-L score, which determines a model’s ranking on the leaderboard. We fine-tune the pre-trained RT model together with retrievals from c-REALM on the ELI5 dataset from KILT.

Our submission tops the KILT leaderboard for long-form question answering on ELI5 with a combined KILT R-L score of 2.36. It improves on the previous leaderboard entry of BART + DPR (KILT R-L score of 1.9), while having a similar number of parameters as the other models on the leaderboard. In terms of text generation quality, we see an improvement of +4.11, +5.78 and +9.14 Rouge-L over T5, BART + DPR and RAG, respectively.

Our results on the KILT leaderboard for long form question answering on ELI5.

Example Generations from the RT Model

  • Q: Why are almost all boats white?

    Boats are generally painted white for obvious reasons, as they are inexpensive and easy to clean. Ships are also often painted to provide a visible visual contrast to the sea. Boats also often have high visibility ports to allow for easy international trade. In addition, many boats have an air-breathing function that makes them easier to spot and stay afloat, which is a good thing if you’re a pirate or trying to get into some country’s waters. Lastly, many ships are built in the last few decades, so they are not as exotic looking as they were thirty years ago. And lastly, many ships are made out of materials that are easy to work with, and so many of them are painted white as a practical matter. There are many, many more reasons for why ships are white, but those are the ones I know of. I’m sure there are many other reasons, but those are the ones I know of. Edit: I forgot to mention that the other answers about visibility and easy to see ports are also true. Ships are painted to appear to be larger, and more expensive. That makes them appear more impressive, and so they are more likely to be bought by people who want to show off their wealth and status.

  • Q: What causes the trail behind jets at high altitude?

    The same thing that causes the clouds you see at the start of a storm to form in the first place. The air pressure changes as the jet moves through the air, causing it to cool down and condense into water droplets. These droplets are in fact a lot bigger than the water droplets in the clouds, so when the plane passes through them, they can cause the air pressure to change, causing the water droplets to condense into clouds. This is what you see when you look at a cloud formation from the ground. In addition to this, when the plane passes through the clouds, it also moves through the air, causing it to change its temperature, causing the water droplets to condense and form clouds. These clouds are often quite large, and can block out the sun. This is what you see when you look at a contrail. Finally, the water vapor that forms at high altitudes is cooled by the air above, and as it cools, it condenses into clouds. This is what you see when you look at a cumulonimbus cloud. These clouds are even larger than the water droplets in the clouds, and they block out the sun.

  • Q: How do shampoo and conditioner work?

    The key ingredients in shampoo and conditioner are usually surfactants, they tend to attract oil from your skin. Shampoo will usually contain an anti-dandruff agent (dandruff is caused by fungi in your skin which cause dandruff). They also contain some moisturizing agents (moisturizers are things that stay in your skin longer and protect your scalp from cleaning), and a lot of anti-dandruff (dandruff is a side effect of high doses of both surfactants and moisturizers) Conditioner will contain moisturizing agents, like a skin moisturizer. They tend to cause oils to be separated from your hair (e.g. wash your hair with conditioner instead of shampoo). They both have a foaming agent that makes them foam up (usually Hydrogen Peroxide). They also have a foaming agent that stops dirt from attaching to your hair. Some conditioner will also suppress the action of the shampoo (to help prevent dirt from attaching to your hair).

  • Q: When glancing at a clock, why does the first second after glancing at it sometimes feel longer than the rest?

    This is known as chronostasis. Your brain gets used to seeing the second hand move after the first, so it ignores the first one. If you were to look at the clock a lot, you’d see the second hand move a lot more than the first, and you’d be able to tell the difference between the second and the first. The second hand will appear to move much slower than the first hand because your brain is used to seeing the second hand move a lot more than the first. If you were to go on a long trip, you’d get used to the second hand moving a lot more than the first, and you’d have no problem telling the difference between the second and the first hand. It’s kind of like how you don’t notice your nose when you’re driving, but if you look at it for a while, you start to notice it. Your brain is just getting used to seeing the nose move a lot more than the other, and it can’t tell the difference between the two. It’s kind of like that with your eyes.

  • Q: Why do Britain and other English empire countries still bow to monarchs? What real purpose does the queen serve?

    Depends what you define as “purpose”. There is no specific purpose for the British monarch – rather, they are figureheads that serve a role. The role is to have a high degree of influence over a country, though the rules of the monarchy can somewhat vary from country to country, and the ceremony of “kissing the royal corpse” does not always involve kissing the monarch. Whether that’s the only reason for the custom, or if it was the only reason, I can’t say, but that’s the reasoning. When the U.K. was in full power, the monarch was a puppet of Parliament, and the powers of the monarch were transferred to the Prime Minister and thus to the Prime Minister’s deputy, who then became the Prime Minister. Since then, Parliament has been able to vote on legislation that goes through the monarch, although they may still act as the monarch’s representative in negotiating treaties, which can have very very deep consequences. The Queen’s role, as a representative of Britain, doesn’t necessarily involve her formal approval of any of the laws or legislation that goes through Parliament, though.

  • Q: What exactly is fire, in detail? How can light and heat come from something we can’t really touch?

    Fire is the chemical reaction of fuel (oxygen) with (what we call) impurities (ash, soot, oil etc). These impurities are created due to a number of things. If you were to study the chemical composition of a log (or a candle) and analyse it you would discover the chemical formula (A, C, and G) which is a simplified version of what actually happens. There is always fuel, and there is always the ash/soot. Wood is made up of mostly carbon and hydrogen and these are the fuels that create the fire. Hydrogen and oxygen are reactive, and they form water (H2O) and CO2 (CO2). If we split a log into smaller pieces, we start to see the process of combustion as it occurs. The carbon is burned, and with the explosion of CO, H2O and H2CO2, we get water and carbon dioxide (mostly). If we smash a log together, we start to see what happens when we also hit another log with another log…imagine smashing two sand castles together. We see how the sand castles and brick castles form and collapse at different rates…then it starts to burn. The smoke, ash and flames are the actual combustion of fuel.

  • Q: What goes on in those tall tower buildings owned by major banks?

    The actual buildings are not there for profit but for show. There are many real world historical buildings that have little to no effect other than being a symbol of an economic entity or symbol of an era or location, and nothing more. For example look at Sears, Sears didn’t care what went on inside, it was all about the _appearance_ of its location, the prestige of the location, the facilities and so on. It didn’t care about how long it took it to operate, it was about how much people would pay to go see it. Sears was a landmark as a cultural movement and other big companies followed suit, so if you want to see a building you’ve never seen before, you have to go see Sears, just like you have to see a Toyota Camry for Toyota Camry. They used to be all about building new factories, some of them if I recall, but now that they’re bigger, that means that more factory jobs are coming to them. You’ve probably seen them in stores as stores where people buy and sell stuff, so there aren’t that many places for them to come from. Instead, it’s just for show, a symbol of rich people.

Hurdles Towards Progress in LFQA
However, while the RT system described here tops the public leaderboard, a detailed analysis of the model and the ELI5 dataset reveal some concerning trends.

  • Many held-out questions are paraphrased in the training set. Best answer to similar train questions gets 27.4 ROUGE-L.

  • Simply retrieving answers to random unrelated training questions yields relatively high ROUGE-L, while actual gold answers underperform generations.

  • Conditioning answer generation on random documents instead of relevant ones does not measurably impact its factual correctness. Longer outputs get higher ROUGE-L.

We find little to no evidence that the model is actually grounding its text generation in the retrieved documents — fine-tuning an RT model with random retrievals from Wikipedia (i.e., random retrieval + RT) performs nearly as well as the c-REALM + RT model (24.2 vs 24.4 ROUGE-L). We also find significant overlap in the training, validation and test sets of ELI5 (with several questions being paraphrases of each other), which may eliminate the need for retrievals. The KILT benchmark measures the quality of retrievals and generations separately, without making sure that the text generation actually use the retrievals.

Trivial baselines get higher Rouge-L scores than RAG and BART + DPR.

Moreover, we find issues with the Rouge-L metric used to evaluate the quality of text generation, with trivial nonsensical baselines, such as a Random Training Set answer and Input Copying, achieving relatively high Rouge-L scores (even beating BART + DPR and RAG).

Conclusion
We proposed a system for long form-question answering based on Routing Transformers and REALM, which tops the KILT leaderboard on ELI5. However, a detailed analysis reveals several issues with the benchmark that preclude using it to inform meaningful modelling advances. We hope that the community works together to solve these issues so that researchers can climb the right hills and make meaningful progress in this challenging but important task.

Acknowledgments
The Routing Transformer work has been a team effort involving Aurko Roy, Mohammad Saffar, Ashish Vaswani and David Grangier. The follow-up work on open-domain long-form question answering has been a collaboration involving Kalpesh Krishna, Aurko Roy and Mohit Iyyer. We wish to thank Vidhisha Balachandran, Niki Parmar and Ashish Vaswani for several helpful discussions, and the REALM team (Kenton Lee, Kelvin Guu, Ming-Wei Chang and Zora Tung) for help with their codebase and several useful discussions, which helped us improve our experiments. We are grateful to Tu Vu for help with the QQP classifier used to detect paraphrases in ELI5 train and test sets. We thank Jules Gagnon-Marchand and Sewon Min for suggesting useful experiments on checking ROUGE-L bounds. Finally we thank Shufan Wang, Andrew Drozdov, Nader Akoury and the rest of the UMass NLP group for helpful discussions and suggestions at various stages in the project.

Categories
Offsites

Leveraging Machine Learning for Game Development

Over the years, online multiplayer games have exploded in popularity, captivating millions of players across the world. This popularity has also exponentially increased demands on game designers, as players expect games to be well-crafted and balanced — after all, it’s no fun to play a game where a single strategy beats all the rest.

In order to create a positive gameplay experience, game designers typically tune the balance of a game iteratively:

  1. Stress-test through thousands of play-testing sessions from test users
  2. Incorporate feedback and re-design the game
  3. Repeat 1 & 2 until both the play-testers and game designers are satisfied

This process is not only time-consuming but also imperfect — the more complex the game, the easier it is for subtle flaws to slip through the cracks. When games often have many different roles that can be played, with dozens of interconnecting skills, it makes it all the more difficult to hit the right balance.

Today, we present an approach that leverages machine learning (ML) to adjust game balance by training models to serve as play-testers, and demonstrate this approach on the digital card game prototype Chimera, which we’ve previously shown as a testbed for ML-generated art. By running millions of simulations using trained agents to collect data, this ML-based game testing approach enables game designers to more efficiently make a game more fun, balanced, and aligned with their original vision.

Chimera
We developed Chimera as a game prototype that would heavily lean on machine learning during its development process. For the game itself, we purposefully designed the rules to expand the possibility space, making it difficult to build a traditional hand-crafted AI to play the game.

The gameplay of Chimera revolves around the titular chimeras, creature mash-ups that players aim to strengthen and evolve. The objective of the game is to defeat the opponent’s chimera. These are the key points in the game design:

  • Players may play:
    • creatures, which can attack (through their attack stat) or be attacked (against their health stat), or
    • spells, which produce special effects.
  • Creatures are summoned into limited-capacity biomes, which are placed physically on the board space. Each creature has a preferred biome and will take repeated damage if placed on an incorrect biome or a biome that is over capacity.
  • A player controls a single chimera, which starts off in a basic “egg” state and can be evolved and strengthened by absorbing creatures. To do this, the player must also acquire a certain amount of link energy, which is generated from various gameplay mechanics.
  • The game ends when a player has successfully brought the health of the opponent’s chimera to 0.

Learning to Play Chimera
As an imperfect information card game with a large state space, we expected Chimera to be a difficult game for an ML model to learn, especially as we were aiming for a relatively simple model. We used an approach inspired by those used by earlier game-playing agents like AlphaGo, in which a convolutional neural network (CNN) is trained to predict the probability of a win when given an arbitrary game state. After training an initial model on games where random moves were chosen, we set the agent to play against itself, iteratively collecting game data, that was then used to train a new agent. With each iteration, the quality of the training data improved, as did the agent’s ability to play the game.

The ML agent’s performance against our best hand-crafted AI as training progressed. The initial ML agent (version 0) picked moves randomly.

For the actual game state representation that the model would receive as input, we found that passing an “image” encoding to the CNN resulted in the best performance, beating all benchmark procedural agents and other types of networks (e.g. fully connected). The chosen model architecture is small enough to run on a CPU in reasonable time, which allowed us to download the model weights and run the agent live in a Chimera game client using Unity Barracuda.

An example game state representation used to train the neural network.
In addition to making decisions for the game AI, we also used the model to display the estimated win probability for a player over the course of the game.

Balancing Chimera
This approach enabled us to simulate millions more games than real players would be capable of playing in the same time span. After collecting data from the games played by the best-performing agents, we analyzed the results to find imbalances between the two of the player decks we had designed.

First, the Evasion Link Gen deck was composed of spells and creatures with abilities that generated extra link energy used to evolve a player’s chimera. It also contained spells that enabled creatures to evade attacks. In contrast, the Damage-Heal deck contained creatures of variable strength with spells that focused on healing and inflicting minor damage. Although we had designed these decks to be of equal strength, the Evasion Link Gen deck was winning 60% of the time when played against the Damage-Heal deck.

When we collected various stats related to biomes, creatures, spells, and chimera evolutions, two things immediately jumped out at us:

  1. There was a clear advantage in evolving a chimera — the agent won a majority of the games where it evolved its chimera more than the opponent did. Yet, the average number of evolves per game did not meet our expectations. To make it more of a core game mechanic, we wanted to increase the overall average number of evolves while keeping its usage strategic.
  2. The T-Rex creature was overpowered. Its appearances correlated strongly with wins, and the model would always play the T-Rex regardless of penalties for summoning into an incorrect or overcrowded biome.

From these insights, we made some adjustments to the game. To emphasize chimera evolution as a core mechanism in the game, we decreased the amount of link energy required to evolve a chimera from 3 to 1. We also added a “cool-off” period to the T-Rex creature, doubling the time it took to recover from any of its actions.

Repeating our ‘self-play’ training procedure with the updated rules, we observed that these changes pushed the game in the desired direction — the average number of evolves per game increased, and the T-Rex’s dominance faded.

One example comparison of the T-Rex’s influence before and after balancing. The charts present the number of games won (or lost) when a deck initiates a particular spell interaction (e.g., using the “Dodge” spell to benefit a T-Rex). Left: Before the changes, the T-Rex had a strong influence in every metric examined — highest survival rate, most likely to be summoned ignoring penalties, most absorbed creature during wins. Right: After the changes, the T-Rex was much less overpowered.

By weakening the T-Rex, we successfully reduced the Evasion Link Gen deck’s reliance on an overpowered creature. Even so, the win ratio between the decks remained at 60/40 rather than 50/50. A closer look at the individual game logs revealed that the gameplay was often less strategic than we would have liked. Searching through our gathered data again, we found several more areas to introduce changes in.

To start, we increased the starting health of both players as well as the amount of health that healing spells could replenish. This was to encourage longer games that would allow a more diverse set of strategies to flourish. In particular, this enabled the Damage-Heal deck to survive long enough to take advantage of its healing strategy. To encourage proper summoning and strategic biome placement, we increased the existing penalties on playing creatures into incorrect or overcrowded biomes. And finally, we decreased the gap between the strongest and weakest creatures through minor attribute adjustments.

New adjustments in place, we arrived at the final game balance stats for these two decks:

Deck Avg # evolves per game    
(before → after)    
Win % (1M games)
(before → after)
Evasion Link Gen     1.54 → 2.16     59.1% → 49.8%
Damage Heal 0.86 → 1.76     40.9% → 50.2%

Conclusion
Normally, identifying imbalances in a newly prototyped game can take months of playtesting. With this approach, we were able to not only discover potential imbalances but also introduce tweaks to mitigate them in a span of days. We found that a relatively simple neural network was sufficient to reach high level performance against humans and traditional game AI. These agents could be leveraged in further ways, such as for coaching new players or discovering unexpected strategies. We hope this work will inspire more exploration in the possibilities of machine learning for game development.

Acknowledgements
This project was conducted in collaboration with many people. Thanks to Ryan Poplin, Maxwell Hannaman, Taylor Steil, Adam Prins, Michal Todorovic, Xuefan Zhou, Aaron Cammarata, Andeep Toor, Trung Le, Erin Hoffman-John, and Colin Boswell. Thanks to everyone who contributed through playtesting, advising on game design, and giving valuable feedback.

Categories
Offsites

torch time series, final episode: Attention

This is the final post in a four-part introduction to time-series forecasting with torch. These posts have been the story of a quest for multiple-step prediction, and by now, we’ve seen three different approaches: forecasting in a loop, incorporating a multi-layer perceptron (MLP), and sequence-to-sequence models. Here’s a quick recap.

  • As one should when one sets out for an adventurous journey, we started with an in-depth study of the tools at our disposal: recurrent neural networks (RNNs). We trained a model to predict the very next observation in line, and then, thought of a clever hack: How about we use this for multi-step prediction, feeding back individual predictions in a loop? The result , it turned out, was quite acceptable.

  • Then, the adventure really started. We built our first model “natively” for multi-step prediction, relieving the RNN a bit of its workload and involving a second player, a tiny-ish MLP. Now, it was the MLP’s task to project RNN output to several time points in the future. Although results were pretty satisfactory, we didn’t stop there.

  • Instead, we applied to numerical time series a technique commonly used in natural language processing (NLP): sequence-to-sequence (seq2seq) prediction. While forecast performance was not much different from the previous case, we found the technique to be more intuitively appealing, since it reflects the causal relationship between successive forecasts.

Today we’ll enrich the seq2seq approach by adding a new component: the attention module. Originally introduced around 20141, attention mechanisms have gained enormous traction, so much so that a recent paper title starts out “Attention is Not All You Need”2.

The idea is the following.

In the classic encoder-decoder setup, the decoder gets “primed” with an encoder summary just a single time: the time it starts its forecasting loop. From then on, it’s on its own. With attention, however, it gets to see the complete sequence of encoder outputs again every time it forecasts a new value. What’s more, every time, it gets to zoom in on those outputs that seem relevant for the current prediction step.

This is a particularly useful strategy in translation: In generating the next word, a model will need to know what part of the source sentence to focus on. How much the technique helps with numerical sequences, in contrast, will likely depend on the features of the series in question.

Data input

As before, we work with vic_elec, but this time, we partly deviate from the way we used to employ it. With the original, bi-hourly dataset, training the current model takes a long time, longer than readers will want to wait when experimenting. So instead, we aggregate observations by day. In order to have enough data, we train on years 2012 and 2013, reserving 2014 for validation as well as post-training inspection.

vic_elec_daily <- vic_elec %>%
  select(Time, Demand) %>%
  index_by(Date = date(Time)) %>%
  summarise(
    Demand = sum(Demand) / 1e3) 

elec_train <- vic_elec_daily %>% 
  filter(year(Date) %in% c(2012, 2013)) %>%
  as_tibble() %>%
  select(Demand) %>%
  as.matrix()

elec_valid <- vic_elec_daily %>% 
  filter(year(Date) == 2014) %>%
  as_tibble() %>%
  select(Demand) %>%
  as.matrix()

elec_test <- vic_elec_daily %>% 
  filter(year(Date) %in% c(2014), month(Date) %in% 1:4) %>%
  as_tibble() %>%
  select(Demand) %>%
  as.matrix()

train_mean <- mean(elec_train)
train_sd <- sd(elec_train)

We’ll attempt to forecast demand up to fourteen days ahead. How long, then, should be the input sequences? This is a matter of experimentation; all the more so now that we’re adding in the attention mechanism. (I suspect that it might not handle very long sequences so well).

Below, we go with fourteen days for input length, too, but that may not necessarily be the best possible choice for this series.

n_timesteps <- 7 * 2
n_forecast <- 7 * 2

elec_dataset <- dataset(
  name = "elec_dataset",
  
  initialize = function(x, n_timesteps, sample_frac = 1) {
    
    self$n_timesteps <- n_timesteps
    self$x <- torch_tensor((x - train_mean) / train_sd)
    
    n <- length(self$x) - self$n_timesteps - 1
    
    self$starts <- sort(sample.int(
      n = n,
      size = n * sample_frac
    ))
    
  },
  
  .getitem = function(i) {
    
    start <- self$starts[i]
    end <- start + self$n_timesteps - 1
    lag <- 1
    
    list(
      x = self$x[start:end],
      y = self$x[(start+lag):(end+lag)]$squeeze(2)
    )
    
  },
  
  .length = function() {
    length(self$starts) 
  }
)

batch_size <- 32

train_ds <- elec_dataset(elec_train, n_timesteps, sample_frac = 0.5)
train_dl <- train_ds %>% dataloader(batch_size = batch_size, shuffle = TRUE)

valid_ds <- elec_dataset(elec_valid, n_timesteps, sample_frac = 0.5)
valid_dl <- valid_ds %>% dataloader(batch_size = batch_size)

test_ds <- elec_dataset(elec_test, n_timesteps)
test_dl <- test_ds %>% dataloader(batch_size = 1)

Model

Model-wise, we again encounter the three modules familiar from the previous post: encoder, decoder, and top-level seq2seq module. However, there is an additional component: the attention module, used by the decoder to obtain attention weights.

Encoder

The encoder still works the same way. It wraps an RNN, and returns the final state.

encoder_module <- nn_module(
  
  initialize = function(type, input_size, hidden_size, num_layers = 1, dropout = 0) {
    
    self$type <- type
    
    self$rnn <- if (self$type == "gru") {
      nn_gru(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        dropout = dropout,
        batch_first = TRUE
      )
    } else {
      nn_lstm(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        dropout = dropout,
        batch_first = TRUE
      )
    }
    
  },
  
  forward = function(x) {
    
    x <- self$rnn(x)
    
    # return last states for all layers
    # per layer, a single tensor for GRU, a list of 2 tensors for LSTM
    x <- x[[2]]
    x
    
  }
)

Attention module

In basic seq2seq, whenever it had to generate a new value, the decoder took into account two things: its prior state, and the previous output generated. In an attention-enriched setup, the decoder additionally receives the complete output from the encoder. In deciding what subset of that output should matter, it gets help from a new agent, the attention module.

This, then, is the attention module’s raison d’être: Given current decoder state and well as complete encoder outputs, obtain a weighting of those outputs indicative of how relevant they are to what the decoder is currently up to. This procedure results in the so-called attention weights: a normalized score, for each time step in the encoding, that quantify their respective importance.

Attention may be implemented in a number of different ways. Here, we show two implementation options, one additive, and one multiplicative.

Additive attention

In additive attention, encoder outputs and decoder state are commonly either added or concatenated (we choose to do the latter, below). The resulting tensor is run through a linear layer, and a softmax is applied for normalization.

attention_module_additive <- nn_module(
  
  initialize = function(hidden_dim, attention_size) {
    
    self$attention <- nn_linear(2 * hidden_dim, attention_size)
    
  },
  
  forward = function(state, encoder_outputs) {
    
    # function argument shapes
    # encoder_outputs: (bs, timesteps, hidden_dim)
    # state: (1, bs, hidden_dim)
    
    # multiplex state to allow for concatenation (dimensions 1 and 2 must agree)
    seq_len <- dim(encoder_outputs)[2]
    # resulting shape: (bs, timesteps, hidden_dim)
    state_rep <- state$permute(c(2, 1, 3))$repeat_interleave(seq_len, 2)
    
    # concatenate along feature dimension
    concat <- torch_cat(list(state_rep, encoder_outputs), dim = 3)
    
    # run through linear layer with tanh
    # resulting shape: (bs, timesteps, attention_size)
    scores <- self$attention(concat) %>% 
      torch_tanh()
    
    # sum over attention dimension and normalize
    # resulting shape: (bs, timesteps) 
    attention_weights <- scores %>%
      torch_sum(dim = 3) %>%
      nnf_softmax(dim = 2)
    
    # a normalized score for every source token
    attention_weights
  }
)

Multiplicative attention

In multiplicative attention, scores are obtained by computing dot products between decoder state and all of the encoder outputs. Here too, a softmax is then used for normalization.

attention_module_multiplicative <- nn_module(
  
  initialize = function() {
    
    NULL
    
  },
  
  forward = function(state, encoder_outputs) {
    
    # function argument shapes
    # encoder_outputs: (bs, timesteps, hidden_dim)
    # state: (1, bs, hidden_dim)

    # allow for matrix multiplication with encoder_outputs
    state <- state$permute(c(2, 3, 1))
 
    # prepare for scaling by number of features
    d <- torch_tensor(dim(encoder_outputs)[3], dtype = torch_float())
       
    # scaled dot products between state and outputs
    # resulting shape: (bs, timesteps, 1)
    scores <- torch_bmm(encoder_outputs, state) %>%
      torch_div(torch_sqrt(d))
    
    # normalize
    # resulting shape: (bs, timesteps) 
    attention_weights <- scores$squeeze(3) %>%
      nnf_softmax(dim = 2)
    
    # a normalized score for every source token
    attention_weights
  }
)

Decoder

Once attention weights have been computed, their actual application is handled by the decoder. Concretely, the method in question, weighted_encoder_outputs(), computes a product of weights and encoder outputs, making sure that each output will have appropriate impact.

The rest of the action then happens in forward(). A concatenation of weighted encoder outputs (often called “context”) and current input is run through an RNN. Then, an ensemble of RNN output, context, and input is passed to an MLP. Finally, both RNN state and current prediction are returned.

decoder_module <- nn_module(
  
  initialize = function(type, input_size, hidden_size, attention_type, attention_size = 8, num_layers = 1) {
    
    self$type <- type
    
    self$rnn <- if (self$type == "gru") {
      nn_gru(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        batch_first = TRUE
      )
    } else {
      nn_lstm(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        batch_first = TRUE
      )
    }
    
    self$linear <- nn_linear(2 * hidden_size + 1, 1)
    
    self$attention <- if (attention_type == "multiplicative") attention_module_multiplicative()
      else attention_module_additive(hidden_size, attention_size)
    
  },
  
  weighted_encoder_outputs = function(state, encoder_outputs) {

    # encoder_outputs is (bs, timesteps, hidden_dim)
    # state is (1, bs, hidden_dim)
    # resulting shape: (bs * timesteps)
    attention_weights <- self$attention(state, encoder_outputs)
    
    # resulting shape: (bs, 1, seq_len)
    attention_weights <- attention_weights$unsqueeze(2)
    
    # resulting shape: (bs, 1, hidden_size)
    weighted_encoder_outputs <- torch_bmm(attention_weights, encoder_outputs)
    
    weighted_encoder_outputs
    
  },
  
  forward = function(x, state, encoder_outputs) {
 
    # encoder_outputs is (bs, timesteps, hidden_dim)
    # state is (1, bs, hidden_dim)
    
    # resulting shape: (bs, 1, hidden_size)
    context <- self$weighted_encoder_outputs(state, encoder_outputs)
    
    # concatenate input and context
    # NOTE: this repeating is done to compensate for the absence of an embedding module
    # that, in NLP, would give x a higher proportion in the concatenation
    x_rep <- x$repeat_interleave(dim(context)[3], 3) 
    rnn_input <- torch_cat(list(x_rep, context), dim = 3)
    
    # resulting shapes: (bs, 1, hidden_size) and (1, bs, hidden_size)
    rnn_out <- self$rnn(rnn_input, state)
    rnn_output <- rnn_out[[1]]
    next_hidden <- rnn_out[[2]]
    
    mlp_input <- torch_cat(list(rnn_output$squeeze(2), context$squeeze(2), x$squeeze(2)), dim = 2)
    
    output <- self$linear(mlp_input)
    
    # shapes: (bs, 1) and (1, bs, hidden_size)
    list(output, next_hidden)
  }
  
)

seq2seq module

The seq2seq module is basically unchanged (apart from the fact that now, it allows for attention module configuration). For a detailed explanation of what happens here, please consult the previous post.

seq2seq_module <- nn_module(
  
  initialize = function(type, input_size, hidden_size, attention_type, attention_size, n_forecast, 
                        num_layers = 1, encoder_dropout = 0) {
    
    self$encoder <- encoder_module(type = type, input_size = input_size, hidden_size = hidden_size,
                                   num_layers, encoder_dropout)
    self$decoder <- decoder_module(type = type, input_size = 2 * hidden_size, hidden_size = hidden_size,
                                   attention_type = attention_type, attention_size = attention_size, num_layers)
    self$n_forecast <- n_forecast
    
  },
  
  forward = function(x, y, teacher_forcing_ratio) {
    
    outputs <- torch_zeros(dim(x)[1], self$n_forecast)$to(device = device)
    encoded <- self$encoder(x)
    encoder_outputs <- encoded[[1]]
    hidden <- encoded[[2]]
    # list of (batch_size, 1), (1, batch_size, hidden_size)
    out <- self$decoder(x[ , n_timesteps, , drop = FALSE], hidden, encoder_outputs)
    # (batch_size, 1)
    pred <- out[[1]]
    # (1, batch_size, hidden_size)
    state <- out[[2]]
    outputs[ , 1] <- pred$squeeze(2)
    
    for (t in 2:self$n_forecast) {
      
      teacher_forcing <- runif(1) < teacher_forcing_ratio
      input <- if (teacher_forcing == TRUE) pred$unsqueeze(3) else y[ , t - 1]
      out <- self$decoder(pred$unsqueeze(3), state, encoder_outputs)
      pred <- out[[1]]
      state <- out[[2]]
      outputs[ , t] <- pred$squeeze(2)
      
    }
    
    outputs
  }
  
)

When instantiating the top-level model, we now have an additional choice: that between additive and multiplicative attention. In the “accuracy” sense of performance, my tests did not show any differences. However, the multiplicative variant is a lot faster.

net <- seq2seq_module("gru", input_size = 1, hidden_size = 32, attention_type = "multiplicative",
                      attention_size = 8, n_forecast = n_timesteps)

# training RNNs on the GPU currently prints a warning that may clutter 
# the console
# see https://github.com/mlverse/torch/issues/461
# alternatively, use 
# device <- "cpu"
device <- torch_device(if (cuda_is_available()) "cuda" else "cpu")

net <- net$to(device = device)

Training

Just like last time, in model training, we get to choose the degree of teacher forcing. Below, we go with a fraction of 0.0, that is, no forcing at all.

optimizer <- optim_adam(net$parameters, lr = 0.001)

num_epochs <- 100

train_batch <- function(b, teacher_forcing_ratio) {
  
  optimizer$zero_grad()
  output <- net(b$x$to(device = device), b$y$to(device = device), teacher_forcing_ratio)
  target <- b$y$to(device = device)
  
  loss <- nnf_mse_loss(output, target)
  loss$backward()
  optimizer$step()
  
  loss$item()
  
}

valid_batch <- function(b, teacher_forcing_ratio = 0) {
  
  output <- net(b$x$to(device = device), b$y$to(device = device), teacher_forcing_ratio)
  target <- b$y$to(device = device)
  
  loss <- nnf_mse_loss(output, target)
  
  loss$item()
  
}

for (epoch in 1:num_epochs) {
  
  net$train()
  train_loss <- c()
  
  coro::loop(for (b in train_dl) {
    loss <-train_batch(b, teacher_forcing_ratio = 0.3)
    train_loss <- c(train_loss, loss)
  })
  
  cat(sprintf("nEpoch %d, training: loss: %3.5f n", epoch, mean(train_loss)))
  
  net$eval()
  valid_loss <- c()
  
  coro::loop(for (b in valid_dl) {
    loss <- valid_batch(b)
    valid_loss <- c(valid_loss, loss)
  })
  
  cat(sprintf("nEpoch %d, validation: loss: %3.5f n", epoch, mean(valid_loss)))
}
# Epoch 1, training: loss: 0.83752 
# Epoch 1, validation: loss: 0.83167

# Epoch 2, training: loss: 0.72803 
# Epoch 2, validation: loss: 0.80804 

# ...
# ...

# Epoch 99, training: loss: 0.10385 
# Epoch 99, validation: loss: 0.21259 

# Epoch 100, training: loss: 0.10396 
# Epoch 100, validation: loss: 0.20975 

Evaluation

For visual inspection, we pick a few forecasts from the test set.

net$eval()

test_preds <- vector(mode = "list", length = length(test_dl))

i <- 1

vic_elec_test <- vic_elec_daily %>%
  filter(year(Date) == 2014, month(Date) %in% 1:4)


coro::loop(for (b in test_dl) {
  
  input <- b$x
  output <- net(b$x$to(device = device), b$y$to(device = device), teacher_forcing_ratio = 0)
  preds <- as.numeric(output)
  
  test_preds[[i]] <- preds
  i <<- i + 1
  
})

test_pred1 <- test_preds[[1]]
test_pred1 <- c(rep(NA, n_timesteps), test_pred1, rep(NA, nrow(vic_elec_test) - n_timesteps - n_forecast))

test_pred2 <- test_preds[[21]]
test_pred2 <- c(rep(NA, n_timesteps + 20), test_pred2, rep(NA, nrow(vic_elec_test) - 20 - n_timesteps - n_forecast))

test_pred3 <- test_preds[[41]]
test_pred3 <- c(rep(NA, n_timesteps + 40), test_pred3, rep(NA, nrow(vic_elec_test) - 40 - n_timesteps - n_forecast))

test_pred4 <- test_preds[[61]]
test_pred4 <- c(rep(NA, n_timesteps + 60), test_pred4, rep(NA, nrow(vic_elec_test) - 60 - n_timesteps - n_forecast))

test_pred5 <- test_preds[[81]]
test_pred5 <- c(rep(NA, n_timesteps + 80), test_pred5, rep(NA, nrow(vic_elec_test) - 80 - n_timesteps - n_forecast))


preds_ts <- vic_elec_test %>%
  select(Demand, Date) %>%
  add_column(
    ex_1 = test_pred1 * train_sd + train_mean,
    ex_2 = test_pred2 * train_sd + train_mean,
    ex_3 = test_pred3 * train_sd + train_mean,
    ex_4 = test_pred4 * train_sd + train_mean,
    ex_5 = test_pred5 * train_sd + train_mean) %>%
  pivot_longer(-Date) %>%
  update_tsibble(key = name)


preds_ts %>%
  autoplot() +
  scale_color_hue(h = c(80, 300), l = 70) +
  theme_minimal()
A sample of two-weeks-ahead predictions for the test set, 2014.

(#fig:unnamed-chunk-11)A sample of two-weeks-ahead predictions for the test set, 2014.

We can’t directly compare performance here to that of previous models in our series, as we’ve pragmatically redefined the task. The main goal, however, has been to introduce the concept of attention. Specifically, how to manually implement the technique – something that, once you’ve understood the concept, you may never have to do in practice. Instead, you would likely make use of existing tools that come with torch (multi-head attention and transformer modules), tools we may introduce in a future “season” of this series.

Thanks for reading!

Photo by David Clode on Unsplash

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2014. “Neural Machine Translation by Jointly Learning to Align and Translate.” CoRR abs/1409.0473. http://arxiv.org/abs/1409.0473.

Dong, Yihe, Jean-Baptiste Cordonnier, and Andreas Loukas. 2021. “Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth.” arXiv E-Prints, March, arXiv:2103.03404. http://arxiv.org/abs/2103.03404.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” arXiv E-Prints, June, arXiv:1706.03762. http://arxiv.org/abs/1706.03762.

Vinyals, Oriol, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey E. Hinton. 2014. “Grammar as a Foreign Language.” CoRR abs/1412.7449. http://arxiv.org/abs/1412.7449.

Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.” CoRR abs/1502.03044. http://arxiv.org/abs/1502.03044.

  1. For important papers see: Bahdanau, Cho, and Bengio (2014), Vinyals et al. (2014), Xu et al. (2015).↩︎

  2. Dong, Cordonnier, and Loukas (2021), thus replying to Vaswani et al. (2017), whose 2017 paper “went viral” for stating the opposite.↩︎

Categories
Offsites

Massively Parallel Graph Computation: From Theory to Practice

Graphs are useful theoretical representations of the connections between groups of entities, and have been used for a variety of purposes in data science, from ranking web pages by popularity and mapping out social networks, to assisting with navigation. In many cases, such applications require the processing of graphs containing hundreds of billions of edges, which is far too large to be processed on a single consumer-grade machine. A typical approach to scaling graph algorithms is to run in a distributed setting, i.e., to partition the data (and the algorithm) among multiple computers to perform the computation in parallel. While this approach allows one to process graphs with trillions of edges, it also introduces new challenges. Namely, because each computer only sees a small piece of the input graph at a time, one needs to handle inter-machine communication and design algorithms that can be split across multiple computers.

A framework for implementing distributed algorithms, MapReduce, was introduced in 2008. It transparently handled communication between machines while offering good fault-tolerance capabilities and inspired the development of a number of distributed computation frameworks, including Pregel, Apache Hadoop, and many others. Still, the challenge of developing algorithms for distributed computation on very large graphs remained, and designing efficient algorithms in this context even for basic problems, such as connected components, maximum matching or shortest paths, has been an active area of research. While recent work has demonstrated new algorithms for many problems, including our algorithms for connected components (both in theory and practice) and hierarchical clustering, there was still a need for methods that could solve a range of problems more quickly.

Today we present a pair of recent papers that address this problem by first constructing a theoretical model for distributed graph algorithms and then demonstrating how the model can be applied. The proposed model, Adaptive Massively Parallel Computation (AMPC), augments the theoretical capabilities of MapReduce, providing a pathway to solve many graph problems in fewer computation rounds. We also show how the AMPC model can be effectively implemented in practice. The suite of algorithms we describe, which includes algorithms for maximal independent set, maximum matching, connected components and minimum spanning tree, work up to 7x faster than current state-of-the-art approaches.

Limitations of MapReduce
In order to understand the limitations of MapReduce for developing graph algorithms, consider a simplified variant of the connected components problem. The input is a collection of rooted trees, and the goal is to compute, for each node, the root of its tree. Even this seemingly simple problem is not easy to solve in MapReduce. In fact, in the Massively Parallel Computation (MPC) model — the theoretical model behind MapReduce, Pregel, Apache Giraph and many other distributed computation frameworks — this problem is widely believed to require at least a number of rounds of computation proportional to log n, where n is the total number of nodes in the graph. While log n may not seem to be a large number, algorithms processing trillion-edge graphs often write hundreds of terabytes of data to disk in each round, and thus even a small reduction in the number of rounds may bring significant resource savings.

The problem of finding root nodes. Nodes are represented by blue circles. Gray arrows point from each node to its parent. The root nodes are the nodes with no parents. The orange arrows illustrate the path an algorithm would follow from a node to the root of the tree to which it belongs.

A similar subproblem showed up in our algorithms for finding connected components and computing a hierarchical clustering. We observed that one can bypass the limitations of MapReduce by implementing these algorithms through the use of a distributed hash table (DHT), a service that is initialized with a collection of key-value pairs and then returns a value associated with a provided key in real-time. In our implementation, for each node, the DHT stores its parent node. Then, a machine that processes a graph node can use the DHT and “walk up” the tree until it reaches the root. While the use of a DHT worked well for this particular problem (although it relied on the input trees being not too deep), it was unclear if the idea could be applied more broadly.

The Adaptive Massively Parallel Computation Model
To extend this approach to other problems, we started by developing a model to theoretically analyze algorithms that utilize a DHT. The resulting AMPC model builds upon the well-established MPC model and formally describes the capabilities brought by the use of a distributed hash table.

In the MPC model there is a collection of machines, which communicate via message passing in synchronous rounds. Messages sent in one round are delivered in the beginning of the following round and constitute that round’s entire input (i.e., the machines do not retain information from one round to the next). In the first round, one can assume that the input is randomly distributed across the machines. The goal is to minimize the number of computation rounds, while assuring load-balancing between machines in each round.

Computation in the MPC model. Each column represents one machine in subsequent computation rounds. Once all machines have completed a round of computation, all messages sent in that round are delivered, and the following round begins.

We then formalized the AMPC model by introducing a new approach, in which machines write to a write-only distributed hash table each round, instead of communicating via messages. Once a new round starts, the hash table from the previous round becomes read-only and a new write-only output hash table becomes available. What is important is that only the method of communication changes — the amount of communication and available space per machine is constrained exactly in the same way as in the MPC model. Hence, at a high level the added capability of the AMPC model is that each machine can choose what data to read, instead of being provided a piece of data.

Computation in the AMPC model. Once all machines have completed a round of computation, the data they produced is saved to a distributed hash table. In the following round, each machine can read arbitrary values from this distributed hash table and write to a new distributed hash table.

Algorithms and Empirical Evaluation
This seemingly small difference in the way machines communicate allowed us to design much faster algorithms to a number of basic graph problems. In particular, we show that it is possible to find connected components, minimum spanning tree, maximal matching and maximal independent set in a constant number of rounds, regardless of the size of the graph.

To investigate the practical applicability of the AMPC algorithms, we have instantiated the model by combining Flume C++ (a C++ counterpart of FlumeJava) with a DHT communication layer. We have evaluated our AMPC algorithms for minimum spanning tree, maximal independent set and maximum matching and observed that we can achieve up to 7x speedups over implementations that did not use a DHT. At the same time, the AMPC implementations used 10x fewer rounds on average to complete, and also wrote less data to disk.

Our implementation of the AMPC model took advantage of hardware-accelerated remote direct memory access (RDMA), a technology that allows reading from the memory of a remote machine with a latency of a few microseconds, which is just an order of magnitude slower than reading from local memory. While some of the AMPC algorithms communicated more data than their MPC counterparts, they were overall faster, as they performed mostly fast reads using RDMA, instead of costly writes to disk.

Conclusion
With the AMPC model, we built a theoretical framework inspired by practically efficient implementations, and then developed new theoretical algorithms that delivered good empirical performance and maintained good fault-tolerance properties. We’ve been happy to see that the AMPC model has already been the subject of further study and are excited to learn what other problems can be solved more efficiently using the AMPC model or its practical implementations.

Acknowledgements
Co-authors on the two papers covered in this blog post include Soheil Behnezhad, Laxman Dhulipala, Hossein Esfandiari, and Warren Schudy. We also thank members of the Graph Mining team for their collaborations, and especially Mohammad Hossein Bateni for his input on this post. To learn more about our recent work on scalable graph algorithms, see videos from our recent Graph Mining and Learning workshop.

Categories
Offsites

torch time series, take three: Sequence-to-sequence prediction

Today, we continue our exploration of multi-step time-series forecasting with torch. This post is the third in a series.

  • Initially, we covered basics of recurrent neural networks (RNNs), and trained a model to predict the very next value in a sequence. We also found we could forecast quite a few steps ahead by feeding back individual predictions in a loop.

  • Next, we built a model “natively” for multi-step prediction. A small multi-layer-perceptron (MLP) was used to project RNN output to several time points in the future.

Of both approaches, the latter was the more successful. But conceptually, it has an unsatisfying touch to it: When the MLP extrapolates and generates output for, say, ten consecutive points in time, there is no causal relation between those. (Imagine a weather forecast for ten days that never got updated.)

Now, we’d like to try something more intuitively appealing. The input is a sequence; the output is a sequence. In natural language processing (NLP), this type of task is very common: It’s exactly the kind of situation we see with machine translation or summarization.

Quite fittingly, the types of models employed to these ends are named sequence-to-sequence models (often abbreviated seq2seq). In a nutshell, they split up the task into two components: an encoding and a decoding part. The former is done just once per input-target pair. The latter is done in a loop, as in our first try. But the decoder has more information at its disposal: At each iteration, its processing is based on the previous prediction as well as previous state. That previous state will be the encoder’s when a loop is started, and its own ever thereafter.

Before discussing the model in detail, we need to adapt our data input mechanism.

Data input

We continue working with vic_elec , provided by tsibbledata.

Again, the dataset definition in the current post looks a bit different from the way it did before; it’s the shape of the target that differs. This time, y equals x, shifted to the left by one.

The reason we do this is owed to the way we are going to train the network. With seq2seq, people often use a technique called “teacher forcing” where, instead of feeding back its own prediction into the decoder module, you pass it the value it should have predicted. To be clear, this is done during training only, and to a configurable degree.

n_timesteps <- 7 * 24 * 2
n_forecast <- n_timesteps

vic_elec_get_year <- function(year, month = NULL) {
  vic_elec %>%
    filter(year(Date) == year, month(Date) == if (is.null(month)) month(Date) else month) %>%
    as_tibble() %>%
    select(Demand)
}

elec_train <- vic_elec_get_year(2012) %>% as.matrix()
elec_valid <- vic_elec_get_year(2013) %>% as.matrix()
elec_test <- vic_elec_get_year(2014, 1) %>% as.matrix()

train_mean <- mean(elec_train)
train_sd <- sd(elec_train)

elec_dataset <- dataset(
  name = "elec_dataset",
  
  initialize = function(x, n_timesteps, sample_frac = 1) {
    
    self$n_timesteps <- n_timesteps
    self$x <- torch_tensor((x - train_mean) / train_sd)
    
    n <- length(self$x) - self$n_timesteps - 1
    
    self$starts <- sort(sample.int(
      n = n,
      size = n * sample_frac
    ))
    
  },
  
  .getitem = function(i) {
    
    start <- self$starts[i]
    end <- start + self$n_timesteps - 1
    lag <- 1
    
    list(
      x = self$x[start:end],
      y = self$x[(start+lag):(end+lag)]$squeeze(2)
    )
    
  },
  
  .length = function() {
    length(self$starts) 
  }
)

Dataset as well as dataloader instantations then can proceed as before.

batch_size <- 32

train_ds <- elec_dataset(elec_train, n_timesteps, sample_frac = 0.5)
train_dl <- train_ds %>% dataloader(batch_size = batch_size, shuffle = TRUE)

valid_ds <- elec_dataset(elec_valid, n_timesteps, sample_frac = 0.5)
valid_dl <- valid_ds %>% dataloader(batch_size = batch_size)

test_ds <- elec_dataset(elec_test, n_timesteps)
test_dl <- test_ds %>% dataloader(batch_size = 1)

Model

Technically, the model consists of three modules: the aforementioned encoder and decoder, and the seq2seq module that orchestrates them.

Encoder

The encoder takes its input and runs it through an RNN. Of the two things returned by a recurrent neural network, outputs and state, so far we’ve only been using output. This time, we do the opposite: We throw away the outputs, and only return the state.

If the RNN in question is a GRU (and assuming that of the outputs, we take just the final time step, which is what we’ve been doing throughout), there really is no difference: The final state equals the final output. If it’s an LSTM, however, there is a second kind of state, the “cell state”. In that case, returning the state instead of the final output will carry more information.

encoder_module <- nn_module(
  
  initialize = function(type, input_size, hidden_size, num_layers = 1, dropout = 0) {
    
    self$type <- type
    
    self$rnn <- if (self$type == "gru") {
      nn_gru(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        dropout = dropout,
        batch_first = TRUE
      )
    } else {
      nn_lstm(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        dropout = dropout,
        batch_first = TRUE
      )
    }
    
  },
  
  forward = function(x) {
    
    x <- self$rnn(x)
    
    # return last states for all layers
    # per layer, a single tensor for GRU, a list of 2 tensors for LSTM
    x <- x[[2]]
    x
    
  }
  
)

Decoder

In the decoder, just like in the encoder, the main component is an RNN. In contrast to previously-shown architectures, though, it does not just return a prediction. It also reports back the RNN’s final state.

decoder_module <- nn_module(
  
  initialize = function(type, input_size, hidden_size, num_layers = 1) {
    
    self$type <- type
    
    self$rnn <- if (self$type == "gru") {
      nn_gru(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        batch_first = TRUE
      )
    } else {
      nn_lstm(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        batch_first = TRUE
      )
    }
    
    self$linear <- nn_linear(hidden_size, 1)
    
  },
  
  forward = function(x, state) {
    
    # input to forward:
    # x is (batch_size, 1, 1)
    # state is (1, batch_size, hidden_size)
    x <- self$rnn(x, state)
    
    # break up RNN return values
    # output is (batch_size, 1, hidden_size)
    # next_hidden is
    c(output, next_hidden) %<-% x
    
    output <- output$squeeze(2)
    output <- self$linear(output)
    
    list(output, next_hidden)
    
  }
  
)

seq2seq module

seq2seq is where the action happens. The plan is to encode once, then call the decoder in a loop.

If you look back to decoder forward(), you see that it takes two arguments: x and state.

Depending on the context, x corresponds to one of three things: final input, preceding prediction, or prior ground truth.

  • The very first time the decoder is called on an input sequence, x maps to the final input value. This is different from a task like machine translation, where you would pass in a start token. With time series, though, we’d like to continue where the actual measurements stop.

  • In further calls, we want the decoder to continue from its most recent prediction. It is only logical, thus, to pass back the preceding forecast.

  • That said, in NLP a technique called “teacher forcing” is commonly used to speed up training. With teacher forcing, instead of the forecast we pass the actual ground truth, the thing the decoder should have predicted. We do that only in a configurable fraction of cases, and – naturally – only while training. The rationale behind this technique is that without this form of re-calibration, consecutive prediction errors can quickly erase any remaining signal.

state, too, is polyvalent. But here, there are just two possibilities: encoder state and decoder state.

  • The first time the decoder is called, it is “seeded” with the final state from the encoder. Note how this is the only time we make use of the encoding.

  • From then on, the decoder’s own previous state will be passed. Remember how it returns two values, forecast and state?

seq2seq_module <- nn_module(
  
  initialize = function(type, input_size, hidden_size, n_forecast, num_layers = 1, encoder_dropout = 0) {
    
    self$encoder <- encoder_module(type = type, input_size = input_size,
                                   hidden_size = hidden_size, num_layers, encoder_dropout)
    self$decoder <- decoder_module(type = type, input_size = input_size,
                                   hidden_size = hidden_size, num_layers)
    self$n_forecast <- n_forecast
    
  },
  
  forward = function(x, y, teacher_forcing_ratio) {
    
    # prepare empty output
    outputs <- torch_zeros(dim(x)[1], self$n_forecast)$to(device = device)
    
    # encode current input sequence
    hidden <- self$encoder(x)
    
    # prime decoder with final input value and hidden state from the encoder
    out <- self$decoder(x[ , n_timesteps, , drop = FALSE], hidden)
    
    # decompose into predictions and decoder state
    # pred is (batch_size, 1)
    # state is (1, batch_size, hidden_size)
    c(pred, state) %<-% out
    
    # store first prediction
    outputs[ , 1] <- pred$squeeze(2)
    
    # iterate to generate remaining forecasts
    for (t in 2:self$n_forecast) {
      
      # call decoder on either ground truth or previous prediction, plus previous decoder state
      teacher_forcing <- runif(1) < teacher_forcing_ratio
      input <- if (teacher_forcing == TRUE) y[ , t - 1, drop = FALSE] else pred
      input <- input$unsqueeze(3)
      out <- self$decoder(input, state)
      
      # again, decompose decoder return values
      c(pred, state) %<-% out
      # and store current prediction
      outputs[ , t] <- pred$squeeze(2)
    }
    outputs
  }
  
)

net <- seq2seq_module("gru", input_size = 1, hidden_size = 32, n_forecast = n_forecast)

# training RNNs on the GPU currently prints a warning that may clutter 
# the console
# see https://github.com/mlverse/torch/issues/461
# alternatively, use 
# device <- "cpu"
device <- torch_device(if (cuda_is_available()) "cuda" else "cpu")

net <- net$to(device = device)

Training

The training procedure is mainly unchanged. We do, however, need to decide about teacher_forcing_ratio, the proportion of input sequences we want to perform re-calibration on. In valid_batch(), this should always be 0, while in train_batch(), it’s up to us (or rather, experimentation). Here, we set it to 0.3.

optimizer <- optim_adam(net$parameters, lr = 0.001)

num_epochs <- 50

train_batch <- function(b, teacher_forcing_ratio) {
  
  optimizer$zero_grad()
  output <- net(b$x$to(device = device), b$y$to(device = device), teacher_forcing_ratio)
  target <- b$y$to(device = device)
  
  loss <- nnf_mse_loss(output, target)
  loss$backward()
  optimizer$step()
  
  loss$item()
  
}

valid_batch <- function(b, teacher_forcing_ratio = 0) {
  
  output <- net(b$x$to(device = device), b$y$to(device = device), teacher_forcing_ratio)
  target <- b$y$to(device = device)
  
  loss <- nnf_mse_loss(output, target)
  
  loss$item()
  
}

for (epoch in 1:num_epochs) {
  
  net$train()
  train_loss <- c()
  
  coro::loop(for (b in train_dl) {
    loss <-train_batch(b, teacher_forcing_ratio = 0.3)
    train_loss <- c(train_loss, loss)
  })
  
  cat(sprintf("nEpoch %d, training: loss: %3.5f n", epoch, mean(train_loss)))
  
  net$eval()
  valid_loss <- c()
  
  coro::loop(for (b in valid_dl) {
    loss <- valid_batch(b)
    valid_loss <- c(valid_loss, loss)
  })
  
  cat(sprintf("nEpoch %d, validation: loss: %3.5f n", epoch, mean(valid_loss)))
}
Epoch 1, training: loss: 0.37961 

Epoch 1, validation: loss: 1.10699 

Epoch 2, training: loss: 0.19355 

Epoch 2, validation: loss: 1.26462 

# ...
# ...

Epoch 49, training: loss: 0.03233 

Epoch 49, validation: loss: 0.62286 

Epoch 50, training: loss: 0.03091 

Epoch 50, validation: loss: 0.54457

It’s interesting to compare performances for different settings of teacher_forcing_ratio. With a setting of 0.5, training loss decreases a lot more slowly; the opposite is seen with a setting of 0. Validation loss, however, is not affected significantly.

Evaluation

The code to inspect test-set forecasts is unchanged.

net$eval()

test_preds <- vector(mode = "list", length = length(test_dl))

i <- 1

coro::loop(for (b in test_dl) {
  
  input <- b$x
  output <- net(input$to(device = device))
  preds <- as.numeric(output)
  
  test_preds[[i]] <- preds
  i <<- i + 1
  
})

vic_elec_jan_2014 <- vic_elec %>%
  filter(year(Date) == 2014, month(Date) == 1)

test_pred1 <- test_preds[[1]]
test_pred1 <- c(rep(NA, n_timesteps), test_pred1, rep(NA, nrow(vic_elec_jan_2014) - n_timesteps - n_forecast))

test_pred2 <- test_preds[[408]]
test_pred2 <- c(rep(NA, n_timesteps + 407), test_pred2, rep(NA, nrow(vic_elec_jan_2014) - 407 - n_timesteps - n_forecast))

test_pred3 <- test_preds[[817]]
test_pred3 <- c(rep(NA, nrow(vic_elec_jan_2014) - n_forecast), test_pred3)


preds_ts <- vic_elec_jan_2014 %>%
  select(Demand) %>%
  add_column(
    mlp_ex_1 = test_pred1 * train_sd + train_mean,
    mlp_ex_2 = test_pred2 * train_sd + train_mean,
    mlp_ex_3 = test_pred3 * train_sd + train_mean) %>%
  pivot_longer(-Time) %>%
  update_tsibble(key = name)


preds_ts %>%
  autoplot() +
  scale_colour_manual(values = c("#08c5d1", "#00353f", "#ffbf66", "#d46f4d")) +
  theme_minimal()
One-week-ahead predictions for January, 2014.

(#fig:unnamed-chunk-8)One-week-ahead predictions for January, 2014.

Comparing this to the forecast obtained from last time’s RNN-MLP combo, we don’t see much of a difference. Is this surprising? To me it is. If asked to speculate about the reason, I would probably say this: In all of the architectures we’ve used so far, the main carrier of information has been the final hidden state of the RNN (one and only RNN in the two previous setups, encoder RNN in this one). It will be interesting to see what happens in the last part of this series, when we augment the encoder-decoder architecture by attention.

Thanks for reading!

Photo by Suzuha Kozuki on Unsplash