<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Josh Meyer&apos;s Website</title>
    <description>Hi! My name&apos;s Josh. I&apos;ve spent the last decade working on speech, language, and AI. This blog is some of what I&apos;m learning along the way. All opinions are my own.</description>
    <link>http://jrmeyer.github.io/</link>
    <atom:link href="http://jrmeyer.github.io/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Fri, 27 Mar 2026 00:07:38 +0000</pubDate>
    <lastBuildDate>Fri, 27 Mar 2026 00:07:38 +0000</lastBuildDate>
    <generator>Jekyll v3.10.0</generator>
    
      <item>
        <title>Replacing Apple Dictation with Moonshine Flow (FOSS + local)</title>
        <description>&lt;p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;tldr&quot;&gt;TLDR;&lt;/h2&gt;

&lt;p&gt;I made a drop-in replacement for Apple’s voice dictation. it’s 100% open source, runs locally, and afaict it’s higher quality. It probably consumes more memory / power though.&lt;/p&gt;

&lt;p&gt;Source Code: https://github.com/JRMeyer/MoonshineFlow&lt;/p&gt;

&lt;p&gt;Here it is in action:&lt;/p&gt;

&lt;center&gt;
  &lt;video controls=&quot;&quot; preload=&quot;metadata&quot; playsinline=&quot;&quot; style=&quot;max-width: 100%; height: auto;&quot;&gt;
    &lt;source src=&quot;/misc/moonshine-flow.mp4&quot; type=&quot;video/mp4&quot; /&gt;
    Your browser does not support the video tag.
  &lt;/video&gt;
&lt;/center&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;background&quot;&gt;Background&lt;/h2&gt;

&lt;p&gt;I use Apple’s built-in voice dictation a lot. In particular, when I picked up Claude Code in the summer of 2025, I started talking into the terminal more than typing. But it’s not just Claude Code. I use dictation for WhatsApp messages, Google searches, emails, basically anywhere I need to input text. Apple Dictation is great because it just works in every text field on MacOS, and it’s easy to set up. It’s “set it and forget it”.&lt;/p&gt;

&lt;p&gt;The downside is that the quality isn’t SOTA. Especially when I’m talking about code. e.g. Apple Dictation often writes “get hub” instead of “GitHub”. But the convenience outweighs the annoyance, and I hand’t seen a drop in replacement.&lt;/p&gt;

&lt;p&gt;I tried &lt;a href=&quot;https://wisprflow.ai/data-controls#:~:text=Transcription%20always%20occurs%20on%20the%20cloud.%20This%20is%20the%20best%20way%20for%20us%20to%20provide%20accurate%2C%20low%20latency%20transcription.&quot;&gt;Wispr Flow&lt;/a&gt; when there was a lot of hype around it. The quality improvement wasn’t worth the overhead to me. Plus, I prefer to keep my audio on my machine.&lt;/p&gt;

&lt;p&gt;I’ve had my eye on &lt;a href=&quot;https://www.moonshine.ai/&quot;&gt;Moonshine&lt;/a&gt; for a while. Open-source models + inference engine that run on MacOS (and lots of other places). &lt;a href=&quot;https://petewarden.com&quot;&gt;Pete Warden&lt;/a&gt; (moonshine founder) is one of the OGs of on-device speech recognition. Moonshine released a macOS app called &lt;a href=&quot;https://note-taker.moonshine.ai/&quot;&gt;Moonshine Note Taker&lt;/a&gt;, so I tried it out and loved it. It’s easy to use, and the quality is great.&lt;/p&gt;

&lt;p&gt;But Note Taker is designed for transcribing into its own window. I wanted a drop-in replacement for Apple Dictation. I.e. a global hotkey that inserts text wherever I put my cursor.&lt;/p&gt;

&lt;p&gt;So I wondered: if Moonshine runs this smoothly on Apple Silicon, maybe with a little help from my friend Claude Code I can hack something together.&lt;/p&gt;

&lt;p&gt;That’s exactly what &lt;a href=&quot;https://github.com/JRMeyer/MoonshineFlow&quot;&gt;Moonshine Flow&lt;/a&gt; is :)&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;what-it-does&quot;&gt;What It Does&lt;/h2&gt;

&lt;p&gt;Moonshine Flow is a menu-bar app. Double-tap the right Option key to start dictation, speak, and tap once to stop. Text streams into whatever app has focus.&lt;/p&gt;

&lt;p&gt;The streaming is a little different for terminals (phrase-by-phrase, not word-by-word) because they don’t support the accessiblty API as well (I haven’t spent a ton of time on this… PRs welcome!).&lt;/p&gt;

&lt;p&gt;Everything runs locally. No audio leaves your machine.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;getting-it-running&quot;&gt;Getting It Running&lt;/h2&gt;

&lt;p&gt;The app uses Moonshine’s &lt;a href=&quot;https://github.com/moonshine-ai/moonshine&quot;&gt;open-source engine&lt;/a&gt; via their &lt;a href=&quot;https://github.com/moonshine-ai/moonshine-swift&quot;&gt;Swift package&lt;/a&gt;. The only manual step is downloading the model files (~290MB). Full setup is in the &lt;a href=&quot;https://github.com/JRMeyer/MoonshineFlow/blob/main/SETUP.md&quot;&gt;repo’s SETUP.md&lt;/a&gt;, but the gist is:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;git clone git@github.com:JRMeyer/MoonshineFlow.git
&lt;span class=&quot;nb&quot;&gt;cd &lt;/span&gt;MoonshineFlow

&lt;span class=&quot;c&quot;&gt;# Download model files (~290MB)&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;MODEL_DIR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;MoonshineFlow/models/medium-streaming-en
&lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;f &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;adapter.ort cross_kv.ort decoder_kv.ort encoder.ort &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
         frontend.ort streaming_config.json tokenizer.bin&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;do
  &lt;/span&gt;curl &lt;span class=&quot;nt&quot;&gt;-L&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;https://download.moonshine.ai/model/medium-streaming-en/quantized/&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$f&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$MODEL_DIR&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$f&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;done

&lt;/span&gt;swift build &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; swift run&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;You’ll need Xcode installed (not just Command Line Tools) on an Apple Silicon Mac running macOS 15+. On first run, grant Microphone, Accessibility, and Input Monitoring permissions.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;enjoy&quot;&gt;Enjoy!&lt;/h2&gt;

&lt;p&gt;PRs welcome :)&lt;/p&gt;

</description>
        <pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate>
        <link>http://jrmeyer.github.io/asr/2026/03/26/moonshine-flow.html</link>
        <guid isPermaLink="true">http://jrmeyer.github.io/asr/2026/03/26/moonshine-flow.html</guid>
        
        
        <category>ASR</category>
        
      </item>
    
      <item>
        <title>Displaying Images in Claude Code</title>
        <description>&lt;p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;

&lt;center&gt;
&lt;img src=&quot;/misc/figs/ghostty-image-mcp-1.png&quot; style=&quot;width: 32%; display: inline-block; vertical-align: top;&quot; /&gt;
&lt;img src=&quot;/misc/figs/ghostty-image-mcp-2.png&quot; style=&quot;width: 32%; display: inline-block; vertical-align: top;&quot; /&gt;
&lt;img src=&quot;/misc/figs/ghostty-image-mcp-3.png&quot; style=&quot;width: 32%; display: inline-block; vertical-align: top;&quot; /&gt;
&lt;/center&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;In my &lt;a href=&quot;/misc/2026/02/21/ghostty-watercolors.html&quot;&gt;last post&lt;/a&gt; I mentioned that Ghostty can display images inline in the terminal. I wanted to do this with &lt;a href=&quot;https://docs.anthropic.com/en/docs/claude-code&quot;&gt;Claude Code&lt;/a&gt;, but it won’t work out of the box.&lt;/p&gt;

&lt;p&gt;The problem is Claude Code doesn’t have a built-in way to send images to the terminal. So I built an &lt;a href=&quot;https://github.com/JRMeyer/ghostty-image-mcp&quot;&gt;MCP server&lt;/a&gt; that does it.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;why-mcp&quot;&gt;Why MCP?&lt;/h2&gt;

&lt;p&gt;An MCP (Model Context Protocol) server is the right tool here because Claude Code needs to write raw bytes directly to the terminal, NOT text. The key is in the &lt;a href=&quot;https://sw.kovidgoyal.net/kitty/graphics-protocol/&quot;&gt;Kitty graphics protocol&lt;/a&gt; escape sequences. None of Claude Code’s built-in tools can do this. The Bash tool captures stdout as text.&lt;/p&gt;

&lt;p&gt;An MCP server runs as a separate process, so it’s not limited to CC’s tool limitations. The MCP captures the controlling TTY at startup, then writes escape sequences directly to the terminal using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;os.write()&lt;/code&gt; on the raw file descriptor. Claude Code never even sees the image data.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;how-it-works&quot;&gt;How It Works&lt;/h2&gt;

&lt;p&gt;The whole server is a single Python file.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;at startup, grab the TTY path via &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/dev/tty&lt;/code&gt; before stdio takes over&lt;/li&gt;
  &lt;li&gt;convert the file to PNG (via &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sips&lt;/code&gt; on macOS, or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rsvg-convert&lt;/code&gt; for SVGs, or CoreGraphics for PDFs)&lt;/li&gt;
  &lt;li&gt;Base64-encode the PNG file path&lt;/li&gt;
  &lt;li&gt;Write a single Kitty graphics escape sequence to the TTY: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;\x1b_Ga=T,f=100,t=f,c={cols},q=2;{path}\x1b\\&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;The terminal reads the file and renders it inline&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;setup&quot;&gt;Setup&lt;/h2&gt;

&lt;p&gt;Clone the repo and add it to Claude Code:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;git clone https://github.com/jrmeyer/ghostty-image-mcp.git
claude mcp add ghostty-image &lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; uv run /path/to/ghostty-image-mcp/server.py&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;You need &lt;a href=&quot;https://docs.astral.sh/uv/&quot;&gt;uv&lt;/a&gt;, Python 3.10+, and a Kitty graphics-compatible terminal (&lt;a href=&quot;https://ghostty.org&quot;&gt;Ghostty&lt;/a&gt;, &lt;a href=&quot;https://sw.kovidgoyal.net/kitty/&quot;&gt;Kitty&lt;/a&gt;, etc.).&lt;/p&gt;

&lt;p&gt;Then just ask Claude to show you things:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;show me ~/photos/cat.jpg
show me this PDF page 5: ~/papers/attention.pdf
make it 2x bigger
rotate it&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;It handles PNG, JPEG, HEIC, SVG, PDF, and anything else &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sips&lt;/code&gt; can convert :)&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;the-repo&quot;&gt;The Repo&lt;/h2&gt;

&lt;p&gt;Everything is at &lt;a href=&quot;https://github.com/JRMeyer/ghostty-image-mcp&quot;&gt;github.com/JRMeyer/ghostty-image-mcp&lt;/a&gt;. It’s one file, about 160 lines. Let me know if you have comments or run into issues.&lt;/p&gt;

</description>
        <pubDate>Mon, 09 Mar 2026 00:00:00 +0000</pubDate>
        <link>http://jrmeyer.github.io/misc/2026/03/09/ghostty-image-mcp.html</link>
        <guid isPermaLink="true">http://jrmeyer.github.io/misc/2026/03/09/ghostty-image-mcp.html</guid>
        
        
        <category>misc</category>
        
      </item>
    
      <item>
        <title>Watercolor Shaders for Ghostty</title>
        <description>&lt;p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;

&lt;center&gt;&lt;img src=&quot;/misc/figs/wet-on-wet.png&quot; style=&quot;width: 600px;&quot; /&gt;&lt;/center&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;I spend most of my time in the terminal. When I tried out &lt;a href=&quot;https://ghostty.org&quot;&gt;Ghostty&lt;/a&gt; this week, I didn’t expect to switch from macOS Terminal – I didn’t have a lot of complaints before. But then I zoomed in and out on the text without the window resizing, and I decided to try it as a daily driver.&lt;/p&gt;

&lt;p&gt;Then I found you can display images right in the terminal. This was huge for me. I ssh into servers a lot and work on code there with Claude, which is great at generating diagrams and charts. Over ssh with macOS Terminal I can’t see them – I’d just scp them to my laptop and open in browser. With Ghostty I can generate an SVG on the server, convert to png, and show it inline. This alone is a major win.&lt;/p&gt;

&lt;center&gt;&lt;img src=&quot;/misc/figs/ghostty-inline-image.png&quot; style=&quot;width: 600px;&quot; /&gt;&lt;/center&gt;
&lt;center&gt;&lt;em&gt;Displaying an image inline in Ghostty&lt;/em&gt;&lt;/center&gt;

&lt;p&gt;So I liked the functionality, but then I found all the fun, not-so-functional things you can do with it. Ghostty has a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;custom-shader&lt;/code&gt; feature that lets you write GLSL fragment shaders that render behind your terminal content.&lt;/p&gt;

&lt;p&gt;I’ve been watercoloring lately, so I wondered if I could make watercolor washes as a terminal background. I ended up making a whole collection of them, each named after a real painting technique. The repo is at &lt;a href=&quot;https://github.com/JRMeyer/ghostty-watercolors&quot;&gt;github.com/JRMeyer/ghostty-watercolors&lt;/a&gt;. PRs are welcome :)&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;the-shaders&quot;&gt;The Shaders&lt;/h2&gt;

&lt;p&gt;There are nine shaders in the collection so far. Each one tries to simulate a different wash technique:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Flat Wash&lt;/strong&gt; – Uniform color with organic edges.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Graded Wash&lt;/strong&gt; – Fades from full color to transparent, top to bottom.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Variegated Wash&lt;/strong&gt; – Two colors blending into each other.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Wet-on-Wet&lt;/strong&gt; – Soft, bleeding color regions like pigment dropped on wet paper.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Glazing&lt;/strong&gt; – Multiple transparent color layers stacked with visible overlap.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Granulating&lt;/strong&gt; – Pigment settles into paper texture, creating a speckled, grainy look.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Salt&lt;/strong&gt; – Fine speckled texture where salt crystals disrupted the wash.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Cauliflower&lt;/strong&gt; – Backruns with fractal edges where wet paint crept into drying areas.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Splatter&lt;/strong&gt; – Random droplets scattered across a light wash, like flicking a loaded brush.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here are a few of them in action:&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;center&gt;&lt;img src=&quot;/misc/figs/flat-wash.png&quot; style=&quot;width: 600px;&quot; /&gt;&lt;/center&gt;
&lt;center&gt;&lt;em&gt;Flat Wash&lt;/em&gt;&lt;/center&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;center&gt;&lt;img src=&quot;/misc/figs/variegated-wash.png&quot; style=&quot;width: 600px;&quot; /&gt;&lt;/center&gt;
&lt;center&gt;&lt;em&gt;Variegated Wash&lt;/em&gt;&lt;/center&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;center&gt;&lt;img src=&quot;/misc/figs/salt.png&quot; style=&quot;width: 600px;&quot; /&gt;&lt;/center&gt;
&lt;center&gt;&lt;em&gt;Salt&lt;/em&gt;&lt;/center&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;getting-started&quot;&gt;Getting Started&lt;/h2&gt;

&lt;p&gt;Add a shader to your Ghostty config (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;~/.config/ghostty/config&lt;/code&gt;):&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;custom-shader &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; /path/to/shader.glsl
background-opacity &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 0.85&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;I also like adding some extra padding so the wash has room to breathe:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;window-padding-x &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 64
window-padding-y &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 64&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Each shader uses a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WASH_HUE&lt;/code&gt; placeholder for the color. Replace it with a value between &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0.0&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1.0&lt;/code&gt; to pick a hue:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;sed&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;s/WASH_HUE/0.6/g&apos;&lt;/span&gt; flat-wash-bg.glsl &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; my-shader.glsl&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;randomizing-per-window&quot;&gt;Randomizing Per Window&lt;/h2&gt;

&lt;p&gt;There’s also a small script called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;randomize-shader.sh&lt;/code&gt; that picks a random shader &lt;em&gt;and&lt;/em&gt; a random color each time it runs. Source it in your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.zshrc&lt;/code&gt;:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;source&lt;/span&gt; /path/to/ghostty-watercolors/randomize-shader.sh&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Then point your Ghostty config to the generated file:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;custom-shader &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; /path/to/ghostty-watercolors/active-shader.glsl&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Now every new terminal window gets a different wash type and color. Pretty cool, right?&lt;/p&gt;

</description>
        <pubDate>Sat, 21 Feb 2026 00:00:00 +0000</pubDate>
        <link>http://jrmeyer.github.io/misc/2026/02/21/ghostty-watercolors.html</link>
        <guid isPermaLink="true">http://jrmeyer.github.io/misc/2026/02/21/ghostty-watercolors.html</guid>
        
        
        <category>misc</category>
        
      </item>
    
      <item>
        <title>An Overview of Multi-Task Learning in Speech Recognition</title>
        <description>&lt;p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;The following blog post is a chapter in my &lt;a href=&quot;http://jrmeyer.github.io/misc/MEYER_dissertation_2019.pdf&quot;&gt;dissertation&lt;/a&gt;, which I finished in the summer of 2019. The field of Automatic Speech Recognition moves fast, but I think you will find the general trends and logic discussed here to hold true today.&lt;/p&gt;

&lt;p&gt;All citations are footnotes, because Markdown ¯\&lt;em&gt;(ツ)&lt;/em&gt;/¯&lt;/p&gt;

&lt;p&gt;Feel free to leave comments below (especially if you have newer research on Multi-Task Learning for speech!)&lt;/p&gt;

&lt;p&gt;Enjoy!&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 id=&quot;roadmap&quot;&gt;Roadmap&lt;/h2&gt;

&lt;p&gt;In this overview of Multi-Task Learning in Automatic Speech Recognition (ASR), we’re going to cover a lot of ground quickly. First, we’re going to define Multi-Task Learning and walk through a very simple example from image recognition. Next, once we have an understanding of Multi-Task Learning and we have a definition of “task”, we will move into a survey of the speech recognition literature&lt;sup id=&quot;fnref:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;The literature survey focuses on acoustic modeling in particular. Speech Recognition has a long history, but this blog post is limited in scope to the Hybrid (i.e. DNN-HMM) and End-to-End approaches. Both approaches involve training Deep Neural Networks, and we will focus on how Multi-Task Learning has been used to train them. We will divvy up the literature along monolingual and multilingual models, and then finally we will touch on Multi-Task Learning in other speech technologies such as Speech Synthesis and Speaker Verification.&lt;/p&gt;

&lt;p&gt;The term “Multi-Task Learning” encompasses more than a single model performing multiple tasks at inference. Multi-Task Learning can be useful even when there is just &lt;em&gt;one&lt;/em&gt; target task of interest. Especially with regards to small datasets, a Multi-Task model can out-perform a model which was trained on just one task.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 id=&quot;an-introduction-to-multi-task-learning&quot;&gt;An Introduction to Multi-Task Learning&lt;/h2&gt;

&lt;h3 id=&quot;definition&quot;&gt;Definition&lt;/h3&gt;

&lt;p&gt;Before we define &lt;em&gt;Multi-Task Learning&lt;/em&gt;, let’s first define what we mean by &lt;em&gt;task&lt;/em&gt;. Some researchers may define a task as a set of data and corresponding target labels (i.e. a task is merely \((X,Y)\)). Other definitions may focus on the statistical function that performs the mapping of data to targets (i.e. a task is the function \(f: X \rightarrow Y\)). In order to be precise, let’s define a task as the combination of data, targets, and mapping function.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;br /&gt;
A &lt;em&gt;task&lt;/em&gt; is the combination of:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Data: \(X\), a sample of data from a certain domain&lt;/li&gt;
  &lt;li&gt;Targets: \(Y\), a sample of targets from a certain domain&lt;/li&gt;
  &lt;li&gt;Mapping Function: \(f: X \rightarrow Y\), a function which maps data to targets 
&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The &lt;em&gt;targets&lt;/em&gt; might be distinct label categories represented by one-hot vectors (e.g. classification labels), or they can be \(N\)-dimensional continuous vectors (e.g. a target for regression)&lt;sup id=&quot;fnref:3&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;Given this definition of &lt;em&gt;task&lt;/em&gt;, we define Multi-Task Learning as a training procedure which updates model parameters such that the parameters optimize performance on multiple tasks in parallel&lt;sup id=&quot;fnref:4&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. At its core, Multi-Task Learning is an approach to parameter estimation for statistical models&lt;sup id=&quot;fnref:caruana1998multitask&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:caruana1998multitask&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:caruana1996algorithms&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:caruana1996algorithms&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;. Even though we use multiple tasks during training, we will produce only one model. A subset of the model’s parameters will be task-specific, and another subset will be shared among all tasks. Shared model parameters are updated according to the error signals of all tasks, whereas task-specific parameters are updated according to the error signal of only one task.&lt;/p&gt;

&lt;p&gt;It is important to note that a Multi-Task model will have both &lt;em&gt;task-dependent&lt;/em&gt; and &lt;em&gt;task-independent&lt;/em&gt; parameters. The main intuition as to why the Multi-Task approach works is the following: if tasks are related, they will rely on a common underlying representation of the data. Learning related tasks together will bias the shared parameters to encode robust, task-independent representations of the data.&lt;/p&gt;

&lt;p&gt;Given this definition of &lt;em&gt;task&lt;/em&gt; and this definition of &lt;em&gt;Multi-Task Learning&lt;/em&gt;, we can start to think about the different ways in which a Multi-Task model can be trained. Probably the most common Multi-Task use-case is the classification of a single dataset \((X)\) as multiple sets of target labels \((Y_{1}, Y_{2} \dots Y_{N})\). This model will perform mappings from \((X)\) into each of the label spaces separately. Another approach is the classification of multiple datasets sampled from various domains \((X_{1}, X_{2} \dots X_{N})\) as their own, dataset-specific targets \((Y_{1}, Y_{2} \dots Y_{N})\). Less commonly, it is possible to classify multiple datasets using one super-set of labels. These different approaches are represented with regards to vanilla feed-forward neural networks in Figure (1).&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;center&gt;&lt;img src=&quot;/misc/figs/color-si-mo.png&quot; style=&quot;width: 400px;&quot; /&gt;&lt;/center&gt;
&lt;center&gt;&lt;strong&gt;Figure 1a&lt;/strong&gt;:  Single Input (&amp;#x2B1B;), Multiple Output (&amp;#x1F7E5;)&lt;/center&gt;

&lt;center&gt;&lt;img src=&quot;/misc/figs/color-mi-so.png&quot; style=&quot;width: 400px;&quot; /&gt;&lt;/center&gt;
&lt;center&gt;&lt;strong&gt;Figure 1b&lt;/strong&gt;: Multiple Input (&amp;#x1F7E6;), Single Output (&amp;#x2B1B;)&lt;/center&gt;

&lt;center&gt;&lt;img src=&quot;/misc/figs/color-mi-mo.png&quot; style=&quot;width: 400px;&quot; /&gt;&lt;/center&gt;
&lt;center&gt;&lt;strong&gt;Figure 1c&lt;/strong&gt;: Multiple Input (&amp;#x1F7E6;), Multiple Output (&amp;#x1F7E5;)&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;center&gt;&lt;strong&gt;Figure 1&lt;/strong&gt;: Possible Neural Multi-Task Architectures. Black layers are task-independent layers, blue layers are task-dependent input layers, and red layers are task-dependent output layers. These figures are modified versions of a figure from Heigold et al. (2013).&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;With regards to neural approaches (c.f. Figure (1), Multi-Task models are usually comprised of three component parts: (1) &lt;strong&gt;shared layers&lt;/strong&gt; (⬛) which serve as a task-independent feature extractor; (2) &lt;strong&gt;task-specific output layers&lt;/strong&gt; (🟥) which serve as task-dependent classifiers or regressors; (3) &lt;strong&gt;task-specific input layers&lt;/strong&gt; (🟦) which serve as feature transformations from domain-specific to domain-general representations. Neural Multi-Task models will always have some hidden layers shared among tasks. This view of a Multi-Task neural network highlights the intuition behind the shared hidden layers - to encode robust representations of the input data.&lt;/p&gt;

&lt;p&gt;With regards to domains in which we have very limited data (i.e. low-resource environments), Multi-Task parameter estimation promises gains in performance which do not require us to collect more in-domain data, as long as we can create new tasks. In the common scenario where an engineer has access to only a small dataset, the best way she could improve performance would be by collecting more data. However, data collection takes time and money. This is the promise of Multi-Task Learning in low-resource domains: if the engineer can create new tasks, then she does not need to collect more data.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h3 id=&quot;an-example-of-multi-task-learning&quot;&gt;An Example of Multi-Task Learning&lt;/h3&gt;

&lt;p&gt;This section gives an example of a Multi-Task image recognition framework, where we start with a single task, create a second task, and train a model to perform both tasks. Starting with a single task, suppose we have the access to the following:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Data: \(X_1\), a collection of photographs of dogs from one camera&lt;/li&gt;
  &lt;li&gt;Targets: \(Y_1\), a finite set of &lt;strong&gt;dog_breed&lt;/strong&gt; labels (e.g. terrier, collie, rottweiler)&lt;/li&gt;
  &lt;li&gt;Mapping Function: \(f_1: X_1 \rightarrow Y_1\), a vanilla feedforward neural network which returns a set of probabilities over labels of &lt;strong&gt;dog_breed&lt;/strong&gt; to a given photograph&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In order to create a new task, we either need to collect some data (\(X_2\)) from a new domain, create new targets (\(Y_2\)), or define a new mapping function (\(f_2: X_1 \rightarrow Y_1\)). Furthermore, we would like to create a &lt;em&gt;related task&lt;/em&gt;, with the hopes of improving performance on the original task. There’s several ways we can go about making a new task. We could use the same set of labels (&lt;strong&gt;dog_breed&lt;/strong&gt;), but a collect new set of pictures from a different camera. We could try classifying each photograph according to the size of the dog, which would mean we created new labels for our existing data. In addition to our vanilla feed-forward network, we could use a convolutional neural network as a mapping function and share some of the hidden layers between the two networks.&lt;/p&gt;

&lt;p&gt;Assuming we don’t want to collect more data and we don’t want to add a new mapping function, the easiest way to create a new task is to create a new set of target labels. Since we only had a single set of labels available (i.e. &lt;strong&gt;dog_breed&lt;/strong&gt; (⬛)), we can manually add a new label to each photo (i.e. &lt;strong&gt;dog_size&lt;/strong&gt; (🟥)) by referencing an encyclopedia of dogs&lt;sup id=&quot;fnref:6&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:6&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;We start with a single dataset of photos of dogs (\(X_1\) (⬛)) and a single set of classification labels (\(Y_1\) (⬛)) for the dog’s breed, and now we’ve added a new set of labels (\(Y_2\) (🟥)) for a classification task of the dog’s size. A few training examples from our training set (\(X_1\) (⬛), \(Y_1\) (⬛), \(Y_2\) (🟥)) may look like what we find in Figure (2).&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;center&gt;&lt;img src=&quot;/misc/figs/rotweiler.jpg&quot; style=&quot;width: 400px;&quot; /&gt;&lt;/center&gt;
&lt;center&gt;Rottweiler (&amp;#x2B1B;), Large (&amp;#x1F7E5;)&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;center&gt;&lt;img src=&quot;/misc/figs/collie.jpg&quot; style=&quot;width: 400px;&quot; /&gt;&lt;/center&gt;
&lt;center&gt;Collie (&amp;#x2B1B;), Large (&amp;#x1F7E5;)&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;center&gt;&lt;img src=&quot;/misc/figs/terrier.jpg&quot; style=&quot;width: 400px;&quot; /&gt;&lt;/center&gt;
&lt;center&gt;Terrier (&amp;#x2B1B;), Small (&amp;#x1F7E5;)&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;center&gt;&lt;strong&gt;Figure 2&lt;/strong&gt;: Three pictures of dogs from our dataset (&lt;strong&gt;X_1&lt;/strong&gt; (&amp;#x2B1B;)), where each picture has been labeled with separate targets: &lt;strong&gt;dog_breed&lt;/strong&gt; (&amp;#x2B1B;), &lt;strong&gt;dog_size&lt;/strong&gt; (&amp;#x1F7E5;)&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;Given this data and our two sets of labels, we can train a Multi-Task neural network to perform classification of both label sets with the vanilla feed-forward architecture shown in Figure (3). This model now has two task-specific output layers and two task-specific penultimate layers. The input layer and following three hidden layers are shared between both tasks. The shared parameters will be updated via the combined error signal of both tasks.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;center&gt;&lt;img src=&quot;/misc/figs/dog-model.png&quot; align=&quot;center&quot; style=&quot;width: 400px;&quot; /&gt;&lt;/center&gt;
&lt;center&gt;&lt;strong&gt;Figure 3&lt;/strong&gt;: Multi-Task DNN for classifying pictures of dogs according to both &lt;strong&gt;dog_breed&lt;/strong&gt; (&amp;#x2B1B;) and &lt;strong&gt;dog_size&lt;/strong&gt; (&amp;#x1F7E5;). Any additional task by definition brings along with it additional parameters, because a subset of model parameters must be task-specific. Task-specific parameters for the new task of &lt;strong&gt;dog_size&lt;/strong&gt; (&amp;#x1F7E5;) classification are shown in red.&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;This example came from image recognition, but now we will move onto to our overview of Multi-Task Learning in Automatic Speech Recognition. As we will see in what follows, researchers have trained Multi-Task Acoustic Models where the auxiliary tasks involve a new data domain, a new label set, or even a new mapping function.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 id=&quot;multi-task-learning-in-asr&quot;&gt;Multi-Task Learning in ASR&lt;/h2&gt;

&lt;p&gt;The Multi-Task Learning discussed here deals with either acoustic modeling in Hybrid (i.e. DNN-HMM) ASR, or it deals with End-to-End ASR&lt;sup id=&quot;fnref:7&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:7&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;7&lt;/a&gt;&lt;/sup&gt;. The Acoustic Model accepts as input a window of audio features (\(X\)) and returns a posterior probability distribution over phonetic targets (\(Y\)). The phonetic targets can be fine-grained context-dependent units (e.g. triphones from a Hybrid model), or these targets may be simply characters (e.g. as in End-to-End approaches). The following survey will focus on how Multi-Task Learning has been used to train these acoustic models, with a focus on the nature of the tasks themselves.&lt;/p&gt;

&lt;p&gt;Past work in Multi-Task acoustic modeling for speech recognition can be split into two broad categories, depending on whether data was used from multiple languages or just one language. In this survey, we will refer to these two branches of research as &lt;em&gt;monolingual&lt;/em&gt; vs. &lt;em&gt;multilingual&lt;/em&gt; approaches. Within each of those two branches, we find sub-branches of research, depending on how the auxiliary tasks are crafted. These major trends are shown in Figure (4), and will be discussed more in-depth below.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;center&gt;&lt;img src=&quot;/misc/figs/overview-MTL.png&quot; align=&quot;center&quot; style=&quot;width: 700px;&quot; /&gt;&lt;/center&gt;
&lt;center&gt;&lt;strong&gt;Figure 4&lt;/strong&gt;: Major Trends in the Research on Multi-Task Learning in Automatic Speech Recognition. Here, &quot;Recording Characteristics&quot; refers to general characteristics of the audio file (i.e. the &quot;recording&quot;), not the quality of the &quot;recording&quot; setup or &quot;recording&quot; equipment. &lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;Within &lt;em&gt;monolingual&lt;/em&gt; Multi-Task acoustic modeling we can identify two trends in the literature. We find that researchers will either (1) predict some additional linguistic representation of the input speech, or (2) explicitly model utterance-level characteristics of the utterance. When using additional linguistic tasks for a single language, each task is a phonetically relevant classification: predicting triphones vs. predicting monophones vs. predicting graphemes&lt;sup id=&quot;fnref:bell2015&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:bell2015&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;8&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:seltzer2013&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:seltzer2013&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;9&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:huang2015&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:huang2015&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;10&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:chen2014&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:chen2014&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;11&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:chen2015&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:chen2015&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;12&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:toshniwal2017multitask&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:toshniwal2017multitask&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;13&lt;/a&gt;&lt;/sup&gt;. When explicitly modeling utterance-specific characteristics, researchers either use adversarial learning to force the model to “forget” channel, noise, and speaker characteristics, or the extra task is a standard regression in order to pay extra attention to these features&lt;sup id=&quot;fnref:shinohara2016adversarial&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:shinohara2016adversarial&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;14&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:serdyuk2016invariant&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:serdyuk2016invariant&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;15&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:tripathi2018adversarial&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:tripathi2018adversarial&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;16&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:saon2017english&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:saon2017english&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;17&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:meng2018speaker&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:meng2018speaker&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;18&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:sun2018domain&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:sun2018domain&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;19&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:parveen2003multitask&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:parveen2003multitask&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;20&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:giri2015improving&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:giri2015improving&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;21&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:chen2015speech&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:chen2015speech&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;22&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:zhang2017attention&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:zhang2017attention&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;23&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;Within &lt;em&gt;multilingual&lt;/em&gt; Multi-Task acoustic modeling we can also identify two main veins of research: (1) using data from some source language(s) or (2) using a pre-trained model from some source language(s). When using data from source languages, most commonly we find researchers training a single neural network with multiple output layers, where each output layer represents phonetic targets from a different language&lt;sup id=&quot;fnref:huang2013&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:huang2013&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;24&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:heigold2013&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:heigold2013&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;25&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:tuske2014multilingual&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:tuske2014multilingual&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;26&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:mohan2015multi&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:mohan2015multi&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;27&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:grezl2016&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:grezl2016&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;28&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:matassoni2018non&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:matassoni2018non&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;29&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:yang2018joint&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:yang2018joint&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;30&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:rao2017multi&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:rao2017multi&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;31&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:jain2018improved&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:jain2018improved&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;32&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:sun2018domain:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:sun2018domain&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;19&lt;/a&gt;&lt;/sup&gt;. As such, these Acoustic Models look like the prototype shown in Figure (1a). When using a pre-trained model from some source language(s) we find researchers using the source model as either a teacher or as a feature extractor for the target language&lt;sup id=&quot;fnref:dupont2005feature&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:dupont2005feature&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;33&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:cui2015multilingual&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:cui2015multilingual&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;34&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:grezl2014adaptation&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:grezl2014adaptation&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;35&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:knill2013investigation&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:knill2013investigation&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;36&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:vu2014multilingual&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:vu2014multilingual&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;37&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:xu2015comparative&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:xu2015comparative&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;38&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:he2018a&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:he2018a&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;39&lt;/a&gt;&lt;/sup&gt;. The source model extracts embeddings of the target speech, and then the embedding is either used as the target for an auxiliary task or the embedding is concatenated to the standard input as a kind of feature enhancement.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h3 id=&quot;monolingual-multi-task-asr&quot;&gt;Monolingual Multi-Task ASR&lt;/h3&gt;

&lt;p&gt;With regards to monolingual Multi-Task Learning in ASR, we find two major tracks of research. The first approach is to find tasks (from the same language) which are linguistically relevant to the main task&lt;sup id=&quot;fnref:stadermann2005multi&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:stadermann2005multi&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;40&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:seltzer2013:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:seltzer2013&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;9&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:huang2015:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:huang2015&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;10&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:bell2015:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:bell2015&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;8&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:arora2017phonological&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:arora2017phonological&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;41&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:chen2014:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:chen2014&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;11&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:chen2015:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:chen2015&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;12&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:chen2015diss&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:chen2015diss&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;42&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:bell2015complementary&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:bell2015complementary&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;43&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:swietojanski2015structured&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:swietojanski2015structured&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;44&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:badino2016phonetic&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:badino2016phonetic&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;45&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:pironkov2016multi&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:pironkov2016multi&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;46&lt;/a&gt;&lt;/sup&gt;. These studies define abstract phonetic categories (e.g. fricatives, liquids, voiced consonants), and use those category labels as auxiliary tasks for frame-level classification.&lt;/p&gt;

&lt;p&gt;The second major track of research in monolingual Multi-Task acoustic modeling involves explicit modeling of speaker, channel, or noise characteristics&lt;sup id=&quot;fnref:shinohara2016adversarial:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:shinohara2016adversarial&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;14&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:serdyuk2016invariant:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:serdyuk2016invariant&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;15&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:tripathi2018adversarial:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:tripathi2018adversarial&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;16&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:saon2017english:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:saon2017english&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;17&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:meng2018speaker:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:meng2018speaker&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;18&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:sun2018domain:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:sun2018domain&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;19&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:parveen2003multitask:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:parveen2003multitask&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;20&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:giri2015improving:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:giri2015improving&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;21&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:chen2015speech:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:chen2015speech&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;22&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:zhang2017attention:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:zhang2017attention&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;23&lt;/a&gt;&lt;/sup&gt;. These studies train the Acoustic Model to identify these characteristics via an additional classification task, or they encourage the model to ignore this information via adversarial learning, or they force the model to map data from the input domain to another domain (e.g. from noisy audio \(\rightarrow\) clean audio).&lt;/p&gt;

&lt;p&gt;All the studies in this section have in common the following: the model in question learns an additional classification of the audio at the frame-level. That is, every chunk of audio sent to the model will be mapped onto a standard ASR category such as a triphone or a character &lt;em&gt;in addition to&lt;/em&gt; an auxiliary mapping which has some linguistic relevance. This linguistic mapping will typically be a broad phonetic class (think vowel vs. consonant) of which the typical target (think triphone) is a member.&lt;/p&gt;

&lt;p&gt;Good examples of defining additional auxiliary tasks via broad, abstract phonetic categories for English can be found in Seltzer (2013)&lt;sup id=&quot;fnref:seltzer2013:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:seltzer2013&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;9&lt;/a&gt;&lt;/sup&gt; and later Huang (2015)&lt;sup id=&quot;fnref:huang2015:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:huang2015&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;10&lt;/a&gt;&lt;/sup&gt;. With regards to low-resource languages, some researchers have created extra tasks using graphemes or a universal phoneset as more abstract classes&lt;sup id=&quot;fnref:chen2014:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:chen2014&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;11&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:chen2015:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:chen2015&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;12&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:chen2015diss:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:chen2015diss&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;42&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;A less linguistic approach, but based on the exact same principle of using more abstract classes as auxiliary targets, Bell (2015)&lt;sup id=&quot;fnref:bell2015:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:bell2015&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;8&lt;/a&gt;&lt;/sup&gt; used monophone alignments as auxiliary targets for DNN-HMM acoustic modeling (in addition to the standard triphone alignments). The authors observed that standard training on context-dependent triphones could easily lead to over-fitting on the training data. When monophones were added as an extra task, they observed \(3-10\%\) relative improvements over baseline systems. The intuition behind this approach is that two triphones belonging to the same phoneme will be treated as completely unrelated classes in standard training by backpropagation. As such, valuable, exploitable information is lost. In follow up studies, these studies&lt;sup id=&quot;fnref:bell2015complementary:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:bell2015complementary&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;43&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:swietojanski2015structured:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:swietojanski2015structured&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;44&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:badino2016phonetic:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:badino2016phonetic&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;45&lt;/a&gt;&lt;/sup&gt; made the linguistic similarities among DNN output targets explicit via linguistic phonetic concepts such as place, manner, and voicing as well as phonetic context embeddings.&lt;/p&gt;

&lt;p&gt;However, the benefits of using linguistic targets vary from study to study, and in their survey paper, Pironkov (2016)&lt;sup id=&quot;fnref:pironkov2016multi:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:pironkov2016multi&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;46&lt;/a&gt;&lt;/sup&gt; concluded that “Using even broader phonetic classes (such as plosive, fricative, nasal, \(\ldots\)) is not efficient for MTL speech recognition”. In particular, they were referring to the null findings from Stadermann (2015)&lt;sup id=&quot;fnref:stadermann2005multi:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:stadermann2005multi&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;40&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;In a similar vein, Multi-Task Learning has been used in an End-to-End framework, in an effort to encourage explicit learning of hierarchical structure of words and phonemes. Oftentimes these hierarchical phonemic levels (e.g. phonemes vs. words) are trained at different levels of the model itself&lt;sup id=&quot;fnref:fernandez2007sequence&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:fernandez2007sequence&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;47&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:sanabria2018hierarchical&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:sanabria2018hierarchical&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;48&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:krishna2018hierarchical&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:krishna2018hierarchical&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;49&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:toshniwal2017multitask:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:toshniwal2017multitask&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;13&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:moriya2018multi&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:moriya2018multi&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;50&lt;/a&gt;&lt;/sup&gt;. Figure (5) displays the approach taken in Sanabria (2018)&lt;sup id=&quot;fnref:sanabria2018hierarchical:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:sanabria2018hierarchical&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;48&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;center&gt;&lt;img src=&quot;/misc/figs/sanabria-2018.png&quot; align=&quot;center&quot; style=&quot;width: 400px;&quot; /&gt;&lt;/center&gt;
&lt;center&gt;&lt;strong&gt;Figure 5&lt;/strong&gt;: Multi-Task Hierarchical Architecture from Sanabria and Metze (2018) &lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;All of these studies have in common the following: they encourage the model to learn abstract linguistic knowledge which is not explicitly available in the standard targets. Whatever the standard target may be (e.g. triphones, graphemes, etc.) the researchers in this section created abstract groupings of those labels, and used those new groupings as an additional task. These new groupings (e.g. monophones, sub-word units, etc) encourage the model to learn the set of underlying features (e.g. voicing, place of articulation, etc.) which distinguish the main targets.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h3 id=&quot;regression-on-better-features-as-a-new-task&quot;&gt;Regression on Better Features as a new Task&lt;/h3&gt;

&lt;p&gt;Class labels are the most common output targets for an auxiliary task, but the authors in &lt;sup id=&quot;fnref:parveen2003multitask:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:parveen2003multitask&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;20&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:giri2015improving:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:giri2015improving&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;21&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:chen2015speech:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:chen2015speech&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;22&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:zhang2017attention:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:zhang2017attention&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;23&lt;/a&gt;&lt;/sup&gt; took an approach where they predicted de-noised versions of the input audio from noisy observations (c.f. Figure (6)). The effect of this regression was that the Acoustic Model cleaned and classified each input audio frame in real time.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;center&gt;&lt;img src=&quot;/misc/figs/giri-2015.png&quot; align=&quot;center&quot; style=&quot;width: 350px;&quot; /&gt;&lt;/center&gt;
&lt;center&gt;&lt;strong&gt;Figure 6&lt;/strong&gt;: Regression and Classification Neural Network Architecture from Giri et al. (2015)&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;In a similar vein, Das (2017)&lt;sup id=&quot;fnref:das2017deep&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:das2017deep&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;51&lt;/a&gt;&lt;/sup&gt; trained an Acoustic Model to classify standard senomes targets as well as regress an input audio frame to bottleneck features of that same frame. Bottleneck features are a compressed representation of the data which have been trained on some other dataset or task — as such bottleneck features should contain linguistic information. In a very early study, the authors in Lu (2004)&lt;sup id=&quot;fnref:lu2004multitask&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:lu2004multitask&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;52&lt;/a&gt;&lt;/sup&gt; predicted enhanced audio frame features as an auxiliary task (along with the speaker’s gender).&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h3 id=&quot;extra-mapping-function-as-new-task&quot;&gt;Extra Mapping Function as New Task&lt;/h3&gt;

&lt;p&gt;In End-to-End ASR, Kim (2017)&lt;sup id=&quot;fnref:kim2017joint&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:kim2017joint&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;53&lt;/a&gt;&lt;/sup&gt; created a Multi-Task model by adding a mapping function (CTC) to an attention-based encoder-decoder model. This is an interesting approach because the two mapping functions (CTC vs. attention) carry with them pros and cons, and the authors demonstrate that the alignment power of the CTC approach can be leveraged to help the attention-based model find good alignments faster. Along similar lines, Lu (2017)&lt;sup id=&quot;fnref:lu2017multitask&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:lu2017multitask&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;54&lt;/a&gt;&lt;/sup&gt; trained an Acoustic Model to make use of both CTC and Sequential Conditional Random Fields. These works did not create new labels or find new data, but rather, they combined different alignment and classification techniques into one model.&lt;/p&gt;

&lt;p&gt;A famous example of monolingual MTL using multiple mapping functions is the most common Kaldi implementation of the so-called “chain” model&lt;sup id=&quot;fnref:povey2016purely&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:povey2016purely&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;55&lt;/a&gt;&lt;/sup&gt;. This implementation uses different output layers on a standard feed-forward model, one output layer calculating standard Cross Entropy Loss, and the other calculating a version of the Maximum Mutual Information Criterion.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h3 id=&quot;new-domain-as-new-task&quot;&gt;New Domain as New Task&lt;/h3&gt;

&lt;p&gt;If we consider the different characteristics of each recording as domain memberships, then any extra information we have access to (e.g. age, gender, location, noise environment), can be framed as domain information, and this information can be explicitly modeled in a Multi-Task model. Using a Multi-Task adversarial framework, these studies&lt;sup id=&quot;fnref:shinohara2016adversarial:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:shinohara2016adversarial&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;14&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:serdyuk2016invariant:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:serdyuk2016invariant&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;15&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:tripathi2018adversarial:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:tripathi2018adversarial&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;16&lt;/a&gt;&lt;/sup&gt; taught an Acoustic Model to forget the differences between different noise conditions, these studies&lt;sup id=&quot;fnref:saon2017english:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:saon2017english&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;17&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:meng2018speaker:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:meng2018speaker&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;18&lt;/a&gt;&lt;/sup&gt; taught their model to forget speakers, and Sun (2018)&lt;sup id=&quot;fnref:sun2018domain:3&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:sun2018domain&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;19&lt;/a&gt;&lt;/sup&gt; taught the model to forget accents.&lt;/p&gt;

&lt;p&gt;In low-resource domains, it is often tempting to add data from a large, out-of-domain dataset into the training set. However, if the domains are different enough a mixed training set may hurt performance more than it helps. Multi-Task Learning lends itself well to these multi-domain scenarios, allowing us to regulate how much influence the out-of-domain data has over parameter estimation during training. Usually we will want to down-weight the gradient from a source domain task if the source dataset is large or if the task is only somewhat related.&lt;/p&gt;

&lt;p&gt;The researchers in Qin (2018)&lt;sup id=&quot;fnref:qin2018automatic&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:qin2018automatic&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;56&lt;/a&gt;&lt;/sup&gt; investigated the low-resource domain of Cantonese aphasic speech. Using a small corpus of aphasic Cantonese speech in addition to two corpora of read Cantonese speech, the researchers simply trained a Multi-Task model with each corpus as its own task (i.e. data from each corpus as classified in its own output layer). Similarly, in an effort to better model child-speech in a low-resource setting, the authors in Tong (2017)&lt;sup id=&quot;fnref:tong2017multi&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:tong2017multi&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;57&lt;/a&gt;&lt;/sup&gt; created separate tasks for classification of child vs. adult speech, in addition to standard phoneme classification.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h3 id=&quot;discussion-of-monolingual-multi-task-learning&quot;&gt;Discussion of Monolingual Multi-Task Learning&lt;/h3&gt;

&lt;p&gt;In this section we’ve covered various examples of how researchers have incorporated Multi-Task Learning into speech recognition using data from a single language. Two major threads of work can be identified: (1) the use of abstract linguistic features as additional tasks, or (2) the use of speaker and other recording information as an extra task.&lt;/p&gt;

&lt;p&gt;With regards to the first track of work, researchers have created abstract linguistic target labels by defining linguistic categories by hand, by referring to the traditional phonetic decision tree, or by automatically finding relevant sub-word parts. Performance improvements with this approach have been found to be larger when working with small datasets&lt;sup id=&quot;fnref:bell2015complementary:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:bell2015complementary&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;43&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:chen2015diss:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:chen2015diss&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;42&lt;/a&gt;&lt;/sup&gt;. The intuition behind this line of work is the following: Forcing a model to classify input speech into broad linguistic classes should encourage the model to learn a set of underlying phonetic features which are useful for the main classification task.&lt;/p&gt;

&lt;p&gt;A discriminative Acoustic Model trained on standard output targets (e.g. triphones, characters) is trained to learn that each target is maximally different from every other target. The label-space at the output layer is N-dimensional, and every class (i.e. phonetic category) occupies a corner of that N-dimensional space. This means that classes are learned to be &lt;em&gt;maximally&lt;/em&gt; distinctive. In reality, we know that some of these targets are more similar than others, but the model does not know that. Taking the example of context-dependent triphones, the Acoustic Model does not have access to the knowledge that an &lt;strong&gt;[a]&lt;/strong&gt; surrounded by &lt;strong&gt;[t]&lt;/strong&gt;’s is the same vowel as an &lt;strong&gt;[a]&lt;/strong&gt; surrounded by &lt;strong&gt;[d]&lt;/strong&gt;’s. In fact, these two &lt;strong&gt;[a]&lt;/strong&gt;’s are treated as if they belong to completely different classes. It is obvious to humans that two flavors of &lt;strong&gt;[a]&lt;/strong&gt; are more similar to each other than an &lt;strong&gt;[a]&lt;/strong&gt; is similar to an &lt;strong&gt;[f]&lt;/strong&gt;. However, the output layer of the neural net does not encode these nuances. Discriminative training on triphone targets will loose the information that some triphones are more similar than others. One way to get that information back is to explicitly teach the model that two &lt;strong&gt;[a]&lt;/strong&gt; triphones belong to the same abstract class. This is the general intuition behind this first track of monolingual Multi-Task work in speech recognition.&lt;/p&gt;

&lt;p&gt;The second track of monolingual Multi-Task acoustic modeling involves explicit modeling of speaker, noise, and other recording characteristics via an auxiliary task. While all of these variables are extra-linguistic, studies have shown that either paying extra attention to them (via an auxiliary classification task) or completely ignoring them (via adversarial learning) can improve overall model performance in terms of Word Error Rate. This is a somewhat puzzling finding. Learning speaker information&lt;sup id=&quot;fnref:lu2004multitask:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:lu2004multitask&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;52&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:chen2015multi&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:chen2015multi&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;58&lt;/a&gt;&lt;/sup&gt; can be useful, but also forgetting speaker information&lt;sup id=&quot;fnref:saon2017english:3&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:saon2017english&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;17&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:meng2018speaker:3&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:meng2018speaker&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;18&lt;/a&gt;&lt;/sup&gt; can be useful.&lt;/p&gt;

&lt;p&gt;If we think of why this may be the case, we can get a little help from latent variable theory. If we think of any recording speech as an observation that has been generated by multiple underlying variables, we can define some variables which generated the audio, such as (1) the words that were said, (2) the speaker, (3) the environmental noise conditions, (4) the acoustics of the recording location, and many, many others. These first four factors are undoubtedly influencers in the acoustic properties of the observed recording. If we know that speaker characteristics and environmental noise had an influence on the audio, then we should either explicitly model them or try to remove them altogether. Both approaches show improvement over a baseline which chooses to not model this extra information at all, but as discovered in Adi (2018)&lt;sup id=&quot;fnref:adi2018reverse&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:adi2018reverse&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;59&lt;/a&gt;&lt;/sup&gt;, if the dataset is large enough, the relative improvements are minor.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 id=&quot;multilingual-multi-task-asr&quot;&gt;Multilingual Multi-Task ASR&lt;/h2&gt;

&lt;p&gt;Multilingual Multi-Task ASR can be split into two main veins, depending on whether (1) the data from a source language is used or (2) a trained model from the source language is used. We find that the first approach is more common, and researchers will train Acoustic Models (or End-to-End models) with multiple, task-dependent output layers (i.e. one output layer for each language), and use the data from all languages in parameter estimation. These models typically share the input layer and all hidden layers among tasks, creating a kind of language-universal feature extractor. This approach has also been extended from multiple languages to multiple accents and dialects. The second vein of multilingual Multi-Task Learning involves using an Acoustic Model from one language as a teacher to train an Acoustic Model from some target language. Most commonly, this source model can be used to generate phoneme-like alignments on the target language data, which are in turn used as targets in an auxiliary task. More often than not, we find multilingual Multi-Task approaches used in a low-resource setting.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h3 id=&quot;multiple-languages-as-multiple-tasks&quot;&gt;Multiple Languages as Multiple Tasks&lt;/h3&gt;

&lt;p&gt;The earliest examples of Multi-Task Learning with multiple languages can be found in Huang (2013)&lt;sup id=&quot;fnref:huang2013:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:huang2013&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;24&lt;/a&gt;&lt;/sup&gt; and Heigold (2013)&lt;sup id=&quot;fnref:heigold2013:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:heigold2013&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;25&lt;/a&gt;&lt;/sup&gt; (c.f. Figure (7)). These studies focused on improving performance on all languages found in the training set, not just one target language. Every language was sampled to the same audio features, and as such the neural networks only required one input layer. However, the network was trained to classify each language using language-specific phoneme targets.  Taking this line of research into the world of End-to-End speech recognition, Dalmia (2018)&lt;sup id=&quot;fnref:dalmia2018&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:dalmia2018&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;60&lt;/a&gt;&lt;/sup&gt; showed that a CTC model trained on multiple languages, and then tuned to one specific language can improve performance over a model trained on that one language in a low-resource setting.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;center&gt;&lt;img src=&quot;/misc/figs/huang-2013-dnn.png&quot; align=&quot;center&quot; style=&quot;width: 500px;&quot; /&gt;&lt;/center&gt;
&lt;center&gt;&lt;strong&gt;Figure 7&lt;/strong&gt;: Multilingual Multi-Task Acoustic Model Architecture from Huang et al. (2013)&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;In a similar vein of work, instead of optimizing performance on all languages present in the training set, researchers have aimed to perform best on one particular target language. See Wang (2015)&lt;sup id=&quot;fnref:wang2015transfer&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:wang2015transfer&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;61&lt;/a&gt;&lt;/sup&gt; for a survey of advances in this area.&lt;/p&gt;

&lt;p&gt;Addressing the use-case where audio is available for a target language, but native-speaker transcribers are not easy to find, Do (2017)&lt;sup id=&quot;fnref:do2017multi&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:do2017multi&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;62&lt;/a&gt;&lt;/sup&gt; employed non-native speakers to transcribe a target language into non-sense words, according to how they perceived the language. Using these non-native transcriptions in addition to a small set of native-speaker transcripts, the authors trained a Multi-Task model to predict phonemes from both native or non-native transcripts. The intuition as to why this approach works is that non-native speakers will perceive sounds from a foreign language using their native phonemic system, and enough overlap should exist between the two languages to help train the acoustic model.&lt;/p&gt;

&lt;p&gt;In the relatively new field of spoken language translation, where speech from one language is mapped directly to a text translation in a second language, these researchers&lt;sup id=&quot;fnref:weiss2017sequence&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:weiss2017sequence&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;63&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:anastasopoulos2018tied&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:anastasopoulos2018tied&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;64&lt;/a&gt;&lt;/sup&gt; created multiple auxiliary tasks by either recognizing the speech of the source language (i.e. standard ASR), or by translating the source language into one or more different languages.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h3 id=&quot;multiple-accents-as-multiple-tasks&quot;&gt;Multiple Accents as Multiple Tasks&lt;/h3&gt;

&lt;p&gt;In a vein of research which belongs somewhere between monolingual and multilingual speech recognition, the authors in &lt;sup id=&quot;fnref:yang2018joint:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:yang2018joint&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;30&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:rao2017multi:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:rao2017multi&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;31&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:jain2018improved:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:jain2018improved&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;32&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:sun2018domain:4&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:sun2018domain&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;19&lt;/a&gt;&lt;/sup&gt; used Multi-Task Learning to perform multi-accent speech recognition. The researchers in Yang (2018)&lt;sup id=&quot;fnref:yang2018joint:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:yang2018joint&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;30&lt;/a&gt;&lt;/sup&gt; trained a model to recognize English, with separate output layers for British English vs. American English. These two tasks were trained in parallel with a third task, accent identification. Combining all three tasks led to optimal output (c.f. Figure (8)). The authors in Rao (2017)&lt;sup id=&quot;fnref:rao2017multi:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:rao2017multi&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;31&lt;/a&gt;&lt;/sup&gt; recognized phonemes of different English accents at an intermediate hidden layer, and then accent-agnostic graphemes at the output layer.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;center&gt;&lt;img src=&quot;/misc/figs/yang-2018.png&quot; align=&quot;center&quot; style=&quot;width: 500px;&quot; /&gt;&lt;/center&gt;
&lt;center&gt;&lt;strong&gt;Figure 8&lt;/strong&gt;: Multi-Accent Deep Neural Network Architecture from Yang et al. (2018)&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h3 id=&quot;multilingual-model-as-feature-extractor&quot;&gt;Multilingual Model as Feature Extractor&lt;/h3&gt;

&lt;p&gt;So-called &lt;em&gt;bottleneck&lt;/em&gt; features have also been developed to aid in low-resource acoustic modeling, which often incorporate Multi-Task Learning. These bottleneck features are activations from a condensed hidden layer in a multilingual acoustic model. First a multilingual Acoustic Model is trained, and then data from a new language is passed through this DNN, and the bottleneck activations are appended as additional features to the original audio&lt;sup id=&quot;fnref:dupont2005feature:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:dupont2005feature&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;33&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:cui2015multilingual:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:cui2015multilingual&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;34&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:grezl2014adaptation:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:grezl2014adaptation&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;35&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:knill2013investigation:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:knill2013investigation&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;36&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:vu2014multilingual:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:vu2014multilingual&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;37&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:xu2015comparative:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:xu2015comparative&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;38&lt;/a&gt;&lt;/sup&gt;. In this way, a kind of universal feature extractor is trained on a large multilingual dataset. The bottleneck features themselves are the product of Multi-Task Learning.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h3 id=&quot;source-language-model-as-teacher&quot;&gt;Source Language Model as Teacher&lt;/h3&gt;

&lt;p&gt;Instead of using data from multiple languages, it is possible to use the predictions from a pre-trained source model as targets in an auxiliary task. In this way, knowledge located in the source dataset is transferred indirectly via a source model, as opposed to &lt;em&gt;directly&lt;/em&gt; from the dataset itself.&lt;/p&gt;

&lt;p&gt;In a very recent approach, He (2018)&lt;sup id=&quot;fnref:he2018a:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:he2018a&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;39&lt;/a&gt;&lt;/sup&gt; trained a classifier on a well-resourced language to identify acoustic landmarks, and then used that well-resourced model to identify acoustic landmarks in a low-resourced language. Those newly discovered acoustic landmarks were then used as targets in an auxiliary task. This approach can be thought of as a kind of Multi-Task Student-Teacher (c.f. Wong (2016)&lt;sup id=&quot;fnref:wong2016sequence&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:wong2016sequence&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;65&lt;/a&gt;&lt;/sup&gt;) approach, where we “distill” (c.f. Hinton (2015)&lt;sup id=&quot;fnref:hinton2015&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:hinton2015&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;66&lt;/a&gt;&lt;/sup&gt;) knowledge from one (larger) model to another via an auxiliary task.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h3 id=&quot;discussion-of-multilingual-multi-task-learning&quot;&gt;Discussion of Multilingual Multi-Task Learning&lt;/h3&gt;

&lt;p&gt;Surveying the literature of Multi-Task Learning in multilingual speech recognition, we can identify some common findings among studies. Firstly, we observe positive effects of pooling of as many languages as possible, even when the languages are completely unrelated. This pooling of unrelated languages may seem strange at first - why should languages as different as English and Mandarin have anything in common? However, abstracting away from the linguistic peculiarities of each language, all languages share some common traits.&lt;/p&gt;

&lt;p&gt;All spoken languages are produced with human lungs, human mouths, and human vocal tracts. This means that all languages are produced in an acoustic space constrained by the anatomy of the human body. If we can bias a model to search for relevant patterns only within this constrained space, then we should expect the model to learn useful representations faster. Likewise, the model should be less likely to learn irrelevant correlated information about environmental noise which occurs outside this humanly-producible acoustic space. This is one intuition as to why the combination of unrelated languages is helpful: any extra language will add inductive bias for the relevant search space of human speech sounds.&lt;/p&gt;

&lt;p&gt;Nevertheless, these studies do show a tendency that closely related languages help each other more than unrelated languages. Both Huang (2013)&lt;sup id=&quot;fnref:huang2013:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:huang2013&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;24&lt;/a&gt;&lt;/sup&gt; and Dalmia (2018)&lt;sup id=&quot;fnref:dalmia2018:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:dalmia2018&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;60&lt;/a&gt;&lt;/sup&gt; concluded that improvement was greater when the languages were more similar. However, they still found that phonetically distinct languages were able to transfer useful bias to each other when a large enough dataset was used for the source language. With regards to how much Multi-Task Learning helps relative to size of the target language dataset, the authors in Heigold (2013)&lt;sup id=&quot;fnref:heigold2013:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:heigold2013&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;25&lt;/a&gt;&lt;/sup&gt; and Huang (2013)&lt;sup id=&quot;fnref:huang2013:3&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:huang2013&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;24&lt;/a&gt;&lt;/sup&gt; saw larger relative improvements when the dataset for the target language was smaller (but they still observed improvements on large datasets).&lt;/p&gt;

&lt;p&gt;In conclusion, Multi-Task Learning for multilingual acoustic modeling yields largest improvements when: (1) the dataset for the target language is small, (2) the auxiliary language is closely related, and (3) the auxiliary language dataset is large.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 id=&quot;multi-task-learning-in-other-speech-technologies&quot;&gt;Multi-Task Learning in Other Speech Technologies&lt;/h2&gt;

&lt;p&gt;In addition to Automatic Speech Recognition, Multi-Task Learning has found its way into other speech technologies. The use of Multi-Task Learning is less established in the following speech technology fields, and as such we find a very interesting mix of different applications and approaches.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h3 id=&quot;speech-synthesis&quot;&gt;Speech Synthesis&lt;/h3&gt;

&lt;p&gt;Working on speech synthesis, the research team in Hu (2015)&lt;sup id=&quot;fnref:hu2015fusion&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:hu2015fusion&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;67&lt;/a&gt;&lt;/sup&gt; used Multi-Task Learning to train neural speech synthesis systems for a single language. These models predicted both the acoustic features (spectral envelope) as well as log amplitude of the output speech. Additionally, these researchers recombined the outputs of both tasks to improve the quality of the final synthesized speech. In a similar vein, authors in Wu (2015)&lt;sup id=&quot;fnref:wu2015deep&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:wu2015deep&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;68&lt;/a&gt;&lt;/sup&gt; employed Multi-Task Learning of vocoder parameters and a perceptual representation of speech (along with bottleneck features) to train a deep neural network for speech synthesis. Working with input features which are not speech or text, but rather ultrasonic images of tongue contours, the authors in Toth (2018)&lt;sup id=&quot;fnref:tothmulti&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:tothmulti&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;69&lt;/a&gt;&lt;/sup&gt; trained a model to perform both phoneme classification as well as regression on the spectral parameters of a vocoder, leading to better performance on both tasks. Recently, in their work on modeling the raw audio waveform, the authors in Gu (2018)&lt;sup id=&quot;fnref:gu2018multi&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:gu2018multi&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;70&lt;/a&gt;&lt;/sup&gt; trained their original WaveNet model to predict frame-level vocoder features as a secondary task.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h3 id=&quot;speech-emotion-recognition&quot;&gt;Speech Emotion Recognition&lt;/h3&gt;

&lt;p&gt;Working on emotion recognition from speech, Parthasarathy (2017)&lt;sup id=&quot;fnref:parthasarathy2017&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:parthasarathy2017&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;71&lt;/a&gt;&lt;/sup&gt; demonstrate that a model can be used to identify multiple (believed to be orthogonal) emotions as separate tasks. The authors in Le (2017)&lt;sup id=&quot;fnref:le2017&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:le2017&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;72&lt;/a&gt;&lt;/sup&gt; take an emotion recognition task which is typically a regression, and discover new, discrete targets (via k-means clustering) to use as targets in later auxiliary tasks. Using classification of “gender” and “naturalness” as auxiliary tasks, Kim (2017)&lt;sup id=&quot;fnref:kim2017&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:kim2017&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;73&lt;/a&gt;&lt;/sup&gt; also found improvements in spoken emotion recognition via Multi-Task training. Recently, the authors in Lotfian (2018)&lt;sup id=&quot;fnref:lotfian2018&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:lotfian2018&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;74&lt;/a&gt;&lt;/sup&gt; trained a model to predict the first and second most salient emotions felt by the evaluator.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h3 id=&quot;speaker-verification&quot;&gt;Speaker Verification&lt;/h3&gt;

&lt;p&gt;With regards to speaker verification, the authors in Liu (2018)&lt;sup id=&quot;fnref:liu2018&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:liu2018&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;75&lt;/a&gt;&lt;/sup&gt; used phoneme classification as an auxiliary task, and the authors in Ding (2018)&lt;sup id=&quot;fnref:ding2018mtgan&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:ding2018mtgan&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;76&lt;/a&gt;&lt;/sup&gt; trained speaker embeddings by jointly optimizing (1) a GAN to distinguish speech from non-speech and (2) a speaker classifier.&lt;/p&gt;

&lt;p&gt;Combing both speech recognition and speaker verification, Chen (2015)&lt;sup id=&quot;fnref:chen2015multi:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:chen2015multi&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;58&lt;/a&gt;&lt;/sup&gt; trained an Acoustic Model to perform both tasks and found improvement. In an adversarial framework, Wang (2018)&lt;sup id=&quot;fnref:wang2018&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:wang2018&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;77&lt;/a&gt;&lt;/sup&gt; taught their model to forget the differences between domains in parallel to identifying speakers.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h3 id=&quot;miscellaneous-speech-applications&quot;&gt;Miscellaneous Speech Applications&lt;/h3&gt;

&lt;p&gt;Extending the work from Multi-Task speech recognition to Key-Word Spotting, the researchers in Panchapagesan (2016)&lt;sup id=&quot;fnref:pan&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:pan&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;78&lt;/a&gt;&lt;/sup&gt; combined parameter-copying and MTL. They first took an Acoustic Model from large-vocabulary English recognition task, re-initialized the weights immediately proceeding the output layer, and retrained with two output layers. One layer predicted only the phonemes in the Key-Word of interest, and another layer predicted senomes from the large vocabulary task.&lt;/p&gt;

&lt;p&gt;To predict turn-taking behavior in conversation, the authors in Hara (2018)&lt;sup id=&quot;fnref:hara2018&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:hara2018&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;79&lt;/a&gt;&lt;/sup&gt; trained a model to jointly predict backchannelling and use of filler words.&lt;/p&gt;

&lt;p&gt;Predicting the severity of speech impairment (i.e. dysarthia) in the speech of patients with Parkinson’s disorder, the researchers in Vasquez (2018)&lt;sup id=&quot;fnref:vasq&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:vasq&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;80&lt;/a&gt;&lt;/sup&gt; trained a model to predict the level of impairment in various articulators (e.g. lips, tongue, larynx, etc.) as multiple tasks.&lt;/p&gt;

&lt;p&gt;The researchers in Xu (2018)&lt;sup id=&quot;fnref:xu&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:xu&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;81&lt;/a&gt;&lt;/sup&gt; trained a model to both separate speech (from a multi-speaker monaural signal) in addition to an auxiliary task of classifying every audio frame as single-speaker vs. multi-speaker vs. no-speaker.&lt;/p&gt;

&lt;p&gt;The researchers in He (2018)&lt;sup id=&quot;fnref:he2018b&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:he2018b&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;82&lt;/a&gt;&lt;/sup&gt; trained a model to both localize speech sources as well as classify incoming audio as speech vs. non-speech.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;In this survey we touched on the main veins of Multi-Task work in speech recognition (as well as other speech technologies). With regards to speech recognition, we identified multilingual and monolingual trends. Multilingual approaches exploit bias from a source language by using a source dataset or a pre-trained source model. Monolingual approaches use targets either at the acoustic frame level or at the recording level. All approaches involve the updating of task-dependent and task-independent parameters. We find it is often the case that Multi-Task Learning is applied to low-resource scenarios, where bias from related tasks can be crucial for successful model training.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;references--footnotes&quot;&gt;References &amp;amp; Footnotes&lt;/h2&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:2&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;For a more complete overview of Multi-Task Learning itself: &lt;a href=&quot;https://ruder.io/multi-task/&quot;&gt;Ruder (2017)&lt;/a&gt; &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;These two kinds of targets are the same with regards to training neural networks via backpropagation. The targets for classification are just a special case of regression targets, where the values in the vector are \(1.0\) or \(0.0\). &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;An exception to this is Multi-Task Adversarial Learning, in which performance on an auxiliary task is encouraged to be as poor as possible. In domain adaptation, an example of this may be forcing a neural net to be blind to the difference between domains. The Adverserial auxiliary task would be classification of &lt;strong&gt;domain-type&lt;/strong&gt;, and the weights would be updated in a way to increase error as much as possible. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:caruana1998multitask&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Caruana (1998): Multitask learning &lt;a href=&quot;#fnref:caruana1998multitask&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:caruana1996algorithms&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Caruana (1996): Algorithms and applications for Multitask Learning &lt;a href=&quot;#fnref:caruana1996algorithms&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:6&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;This is an example of using domain or expert knowledge to create a new task, where the expert knowledge is contained in the encyclopedia. One could also hire a dog expert to label the images manually. Either way, we are exploiting some source of domain-specific knowledge (i.e. knowledge of the physiology of different dog breeds). &lt;a href=&quot;#fnref:6&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:7&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;In the traditional, Hybrid ASR approach (i.e. DNN Acoustic Model + N-gram language model), there’s not a lot of room to use MTL when training the language model or the decoder. &lt;a href=&quot;#fnref:7&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:bell2015&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Bell (2015): Regularization of context-dependent deep neural networks with context-independent multi-task training &lt;a href=&quot;#fnref:bell2015&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:bell2015:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:bell2015:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:seltzer2013&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Seltzer (2013): Multi-task learning in deep neural networks for improved phoneme recognition &lt;a href=&quot;#fnref:seltzer2013&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:seltzer2013:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:seltzer2013:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:huang2015&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Huang (2015): Rapid adaptation for deep neural networks through multi-task learning &lt;a href=&quot;#fnref:huang2015&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:huang2015:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:huang2015:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:chen2014&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Chen (2014): Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition &lt;a href=&quot;#fnref:chen2014&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:chen2014:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:chen2014:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:chen2015&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Chen (2015): Multitask Learning of Deep Neural Networks for Low-resource Speech Recognition &lt;a href=&quot;#fnref:chen2015&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:chen2015:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:chen2015:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:toshniwal2017multitask&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Toshniwal (2017): Multitask Learning with Low-Level Auxiliary Tasks for Encoder-Decoder Based Speech Recognition &lt;a href=&quot;#fnref:toshniwal2017multitask&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:toshniwal2017multitask:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:shinohara2016adversarial&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Shinohara (2016): Adversarial Multi-Task Learning of Deep Neural Networks for Robust Speech Recognition &lt;a href=&quot;#fnref:shinohara2016adversarial&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:shinohara2016adversarial:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:shinohara2016adversarial:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:serdyuk2016invariant&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Serdyuk (2016): Invariant representations for noisy speech recognition &lt;a href=&quot;#fnref:serdyuk2016invariant&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:serdyuk2016invariant:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:serdyuk2016invariant:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:tripathi2018adversarial&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Tripathi (2018): Adversarial Learning of Raw Speech Features for Domain Invariant Speech Recognition &lt;a href=&quot;#fnref:tripathi2018adversarial&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:tripathi2018adversarial:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:tripathi2018adversarial:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:saon2017english&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Saon (2017): English conversational telephone speech recognition by humans and machines &lt;a href=&quot;#fnref:saon2017english&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:saon2017english:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:saon2017english:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:saon2017english:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;4&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:meng2018speaker&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Meng (2018): Speaker-invariant training via adversarial learning &lt;a href=&quot;#fnref:meng2018speaker&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:meng2018speaker:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:meng2018speaker:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:meng2018speaker:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;4&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:sun2018domain&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Sun (2018): Domain Adversarial Training for Accented Speech Recognition &lt;a href=&quot;#fnref:sun2018domain&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:sun2018domain:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:sun2018domain:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:sun2018domain:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;4&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:sun2018domain:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;5&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:parveen2003multitask&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Parveen (2003): Multitask learning in connectionist robust ASR using recurrent neural networks &lt;a href=&quot;#fnref:parveen2003multitask&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:parveen2003multitask:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:parveen2003multitask:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:giri2015improving&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Giri (2015): Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning &lt;a href=&quot;#fnref:giri2015improving&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:giri2015improving:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:giri2015improving:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:chen2015speech&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Chen (2015): Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks &lt;a href=&quot;#fnref:chen2015speech&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:chen2015speech:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:chen2015speech:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:zhang2017attention&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Zhang (2017): Attention-based LSTM with Multi-task Learning for Distant Speech Recognition &lt;a href=&quot;#fnref:zhang2017attention&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:zhang2017attention:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:zhang2017attention:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:huang2013&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Huang (2013): Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers &lt;a href=&quot;#fnref:huang2013&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:huang2013:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:huang2013:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:huang2013:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;4&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:heigold2013&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Heigold (2013): Multilingual acoustic models using distributed deep neural networks &lt;a href=&quot;#fnref:heigold2013&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:heigold2013:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:heigold2013:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:tuske2014multilingual&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Tuske (2014): Multilingual MRASTA features for low-resource keyword search and speech recognition systems &lt;a href=&quot;#fnref:tuske2014multilingual&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:mohan2015multi&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Mohan (2015): Multi-lingual speech recognition with low-rank multi-task deep neural networks &lt;a href=&quot;#fnref:mohan2015multi&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:grezl2016&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Grezl (2016): Boosting performance on low-resource languages by standard corpora: An analysis &lt;a href=&quot;#fnref:grezl2016&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:matassoni2018non&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Matassoni (2018): Non-Native Children Speech Recognition Through Transfer Learning &lt;a href=&quot;#fnref:matassoni2018non&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:yang2018joint&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Yang (2018): Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition &lt;a href=&quot;#fnref:yang2018joint&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:yang2018joint:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:yang2018joint:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:rao2017multi&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Rao (2017): Multi-accent speech recognition with hierarchical grapheme based models &lt;a href=&quot;#fnref:rao2017multi&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:rao2017multi:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:rao2017multi:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:jain2018improved&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Jain (2018): Improved Accented Speech Recognition Using Accent Embeddings and Multi-task Learning &lt;a href=&quot;#fnref:jain2018improved&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:jain2018improved:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:dupont2005feature&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Dupont (2015): Feature extraction and acoustic modeling: an approach for improved generalization across languages and accents &lt;a href=&quot;#fnref:dupont2005feature&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:dupont2005feature:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:cui2015multilingual&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Cui (2015): Multilingual representations for low resource speech recognition and keyword search &lt;a href=&quot;#fnref:cui2015multilingual&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:cui2015multilingual:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:grezl2014adaptation&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Grezl (2014): Adaptation of multilingual stacked bottle-neck neural network structure for new language &lt;a href=&quot;#fnref:grezl2014adaptation&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:grezl2014adaptation:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:knill2013investigation&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Knill (2013): Investigation of multilingual deep neural networks for spoken term detection &lt;a href=&quot;#fnref:knill2013investigation&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:knill2013investigation:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:vu2014multilingual&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Vu (2014): Multilingual deep neural network based acoustic modeling for rapid language adaptation &lt;a href=&quot;#fnref:vu2014multilingual&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:vu2014multilingual:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:xu2015comparative&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Xu (2015): A comparative study of BNF and DNN multilingual training on cross-lingual low-resource speech recognition &lt;a href=&quot;#fnref:xu2015comparative&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:xu2015comparative:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:he2018a&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;He (2018): Improved ASR for Under-Resourced Languages Through Multi-Task Learning with Acoustic Landmarks &lt;a href=&quot;#fnref:he2018a&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:he2018a:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:stadermann2005multi&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Stadermann (2005): Multi-task learning strategies for a recurrent neural net in a hybrid tied-posteriors acoustic mode &lt;a href=&quot;#fnref:stadermann2005multi&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:stadermann2005multi:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:arora2017phonological&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Arora (2017): Phonological feature based mispronunciation detection and diagnosis using multi-task DNNs and active learning &lt;a href=&quot;#fnref:arora2017phonological&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:chen2015diss&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Chen (2015): Multi-task Learning Deep Neural Networks for Automatic Speech Recognition &lt;a href=&quot;#fnref:chen2015diss&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:chen2015diss:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:chen2015diss:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:bell2015complementary&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Bell (2015): Complementary tasks for context-dependent deep neural network acoustic models &lt;a href=&quot;#fnref:bell2015complementary&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:bell2015complementary:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:bell2015complementary:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:swietojanski2015structured&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Swietojanski (2015): Structured output layer with auxiliary targets for context-dependent acoustic modelling &lt;a href=&quot;#fnref:swietojanski2015structured&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:swietojanski2015structured:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:badino2016phonetic&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Badino (2016): Phonetic Context Embeddings for DNN-HMM Phone Recognition &lt;a href=&quot;#fnref:badino2016phonetic&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:badino2016phonetic:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:pironkov2016multi&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Pironkov (2016): Multi-task learning for speech recognition: an overview &lt;a href=&quot;#fnref:pironkov2016multi&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:pironkov2016multi:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:fernandez2007sequence&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Fernandez (2007): Sequence labelling in structured domains with hierarchical recurrent neural networks &lt;a href=&quot;#fnref:fernandez2007sequence&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:sanabria2018hierarchical&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Sanabria (2018): Hierarchical Multi-Task Learning With CTC &lt;a href=&quot;#fnref:sanabria2018hierarchical&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:sanabria2018hierarchical:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:krishna2018hierarchical&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Krishna (2018): Hierarchical Multitask Learning for CTC-based Speech Recognition &lt;a href=&quot;#fnref:krishna2018hierarchical&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:moriya2018multi&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Moriya (2018): Multi-task Learning with Augmentation Strategy for Acoustic-to-word Attention-based Encoder-decoder Speech Recognition &lt;a href=&quot;#fnref:moriya2018multi&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:das2017deep&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Das (2017): Deep Auto-encoder Based Multi-task Learning Using Probabilistic Transcriptions &lt;a href=&quot;#fnref:das2017deep&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:lu2004multitask&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Lu (2004): Multitask learning in connectionist speech recognition &lt;a href=&quot;#fnref:lu2004multitask&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:lu2004multitask:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:kim2017joint&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Kim (2017): Joint CTC-attention based end-to-end speech recognition using Multi-Task Learning &lt;a href=&quot;#fnref:kim2017joint&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:lu2017multitask&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Lu (2017): Multitask Learning with CTC and Segmental CRF for Speech Recognition &lt;a href=&quot;#fnref:lu2017multitask&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:povey2016purely&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Povey (2016): Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI &lt;a href=&quot;#fnref:povey2016purely&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:qin2018automatic&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Qin (2018): Automatic Speech Assessment for People with Aphasia Using TDNN-BLSTM with Multi-Task Learning &lt;a href=&quot;#fnref:qin2018automatic&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:tong2017multi&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Tong (2017): Multi-Task Learning for Mispronunciation Detection on Singapore Children’s Mandarin Speech &lt;a href=&quot;#fnref:tong2017multi&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:chen2015multi&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Chen (2015): Multi-task learning for text-dependent speaker verification &lt;a href=&quot;#fnref:chen2015multi&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:chen2015multi:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:adi2018reverse&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Adi (2018): To Reverse the Gradient or Not: An Empirical Comparison of Adversarial and Multi-task Learning in Speech Recognition &lt;a href=&quot;#fnref:adi2018reverse&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:dalmia2018&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Dalmia (2018): Sequence-based Multi-lingual Low Resource Speech Recognition &lt;a href=&quot;#fnref:dalmia2018&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:dalmia2018:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:wang2015transfer&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Wang (2015): Transfer learning for speech and language processing &lt;a href=&quot;#fnref:wang2015transfer&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:do2017multi&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Do (2017): Multi-task learning using mismatched transcription for under-resourced speech recognition &lt;a href=&quot;#fnref:do2017multi&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:weiss2017sequence&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Weiss (2017): Sequence-to-Sequence Models Can Directly Transcribe Foreign Speech &lt;a href=&quot;#fnref:weiss2017sequence&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:anastasopoulos2018tied&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Anastasopoulos (2018): Tied Multitask Learning for Neural Speech Translation &lt;a href=&quot;#fnref:anastasopoulos2018tied&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:wong2016sequence&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Wong (2016): Sequence student-teacher training of deep neural networks &lt;a href=&quot;#fnref:wong2016sequence&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:hinton2015&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Hinton (2015): Distilling the knowledge in a neural network &lt;a href=&quot;#fnref:hinton2015&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:hu2015fusion&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Hu (2015): Fusion of multiple parameterisations for DNN-based sinusoidal speech synthesis with multi-task learning &lt;a href=&quot;#fnref:hu2015fusion&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:wu2015deep&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Wu (2015): Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis &lt;a href=&quot;#fnref:wu2015deep&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:tothmulti&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Toth (2018): Multi-Task Learning of Speech Recognition and Speech Synthesis Parameters for Ultrasound-based Silent Speech Interfaces &lt;a href=&quot;#fnref:tothmulti&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:gu2018multi&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Gu (2015): Multi-task WaveNet: A Multi-task Generative Model for Statistical Parametric Speech Synthesis without Fundamental Frequency Conditions &lt;a href=&quot;#fnref:gu2018multi&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:parthasarathy2017&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Parthasarathy (2017): Jointly predicting arousal, valence and dominance with multi-task learning &lt;a href=&quot;#fnref:parthasarathy2017&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:le2017&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Le (2017): Discretized continuous speech emotion recognition with multi-task deep recurrent neural network &lt;a href=&quot;#fnref:le2017&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:kim2017&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Kim (2017): Towards Speech Emotion Recognition in the wild using Aggregated Corpora and Deep Multi-Task Learning &lt;a href=&quot;#fnref:kim2017&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:lotfian2018&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Lotfian (2018): Predicting Categorical Emotions by Jointly Learning Primary and Secondary Emotions through Multitask Learning &lt;a href=&quot;#fnref:lotfian2018&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:liu2018&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Liu (2018): Speaker Embedding Extraction with Phonetic Information &lt;a href=&quot;#fnref:liu2018&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:ding2018mtgan&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Ding (2018): MTGAN: Speaker Verification through Multitasking Triplet Generative Adversarial Networks &lt;a href=&quot;#fnref:ding2018mtgan&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:wang2018&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Wang (2018): Unsupervised domain adaptation via domain adversarial training for speaker recognition &lt;a href=&quot;#fnref:wang2018&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:pan&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Panchapagesan (2016): Multi-Task Learning and Weighted Cross-Entropy for DNN-Based Keyword Spotting &lt;a href=&quot;#fnref:pan&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:hara2018&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Hara (2018): Prediction of Turn-taking Using Multitask Learning with Prediction of Backchannels and Fillers &lt;a href=&quot;#fnref:hara2018&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:vasq&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Vasquez (2018): A Multitask Learning Approach to Assess the Dysarthria Severity in Patients with Parkinson’s Disease &lt;a href=&quot;#fnref:vasq&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:xu&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Xu (2018): A Shifted Delta Coefficient Objective for Monaural Speech Separation Using Multi-task Learning &lt;a href=&quot;#fnref:xu&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:he2018b&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;He (2018): Joint Localization and Classification of Multiple Sound Sources Using a Multi-task Neural Network &lt;a href=&quot;#fnref:he2018b&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Sat, 21 Mar 2020 00:00:00 +0000</pubDate>
        <link>http://jrmeyer.github.io/asr/2020/03/21/overview-mtl-in-asr.html</link>
        <guid isPermaLink="true">http://jrmeyer.github.io/asr/2020/03/21/overview-mtl-in-asr.html</guid>
        
        
        <category>ASR</category>
        
      </item>
    
      <item>
        <title>My INTERSPEECH Schedule</title>
        <description>&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;In what follows, you can find my tentative schedule for each day. There’s a ton of cool stuff going on at this year’s Interspeech, so I took the following approach:&lt;/p&gt;

&lt;p&gt;First, I decided I’m going to focus on core end-to-end TTS and ASR technology, with a preference for TTS in the case of a tie.&lt;/p&gt;

&lt;p&gt;Then, I went through the &lt;a href=&quot;https://interspeech2019.org/program/schedule&quot;&gt;schedule&lt;/a&gt; and but a black box around all sessions that were most interesting to me.&lt;/p&gt;

&lt;p&gt;Last, I took a last pass over the sessions and put stars next to the sessions I definitely don’t want to miss.&lt;/p&gt;

&lt;h1 id=&quot;sunday&quot;&gt;Sunday&lt;/h1&gt;

&lt;object data=&quot;/misc/1.pdf&quot; type=&quot;application/pdf&quot; width=&quot;100%&quot; height=&quot;700px&quot; style=&quot;max-width: 700px;&quot;&gt;
    &lt;embed src=&quot;/misc/1.pdf&quot; /&gt;
        &lt;p&gt;This browser does not support PDFs. Please download the PDF to view it: &lt;a href=&quot;/misc/1.pdf&quot;&gt;Download PDF&lt;/a&gt;.&lt;/p&gt;
    &amp;lt;/embed&amp;gt;
&lt;/object&gt;

&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;h1 id=&quot;monday&quot;&gt;Monday&lt;/h1&gt;

&lt;object data=&quot;/misc/2.pdf&quot; type=&quot;application/pdf&quot; width=&quot;100%&quot; height=&quot;700px&quot; style=&quot;max-width: 700px;&quot;&gt;
    &lt;embed src=&quot;/misc/2.pdf&quot; /&gt;
        &lt;p&gt;This browser does not support PDFs. Please download the PDF to view it: &lt;a href=&quot;/misc/2.pdf&quot;&gt;Download PDF&lt;/a&gt;.&lt;/p&gt;
    &amp;lt;/embed&amp;gt;
&lt;/object&gt;

&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;h1 id=&quot;tuesday&quot;&gt;Tuesday&lt;/h1&gt;

&lt;object data=&quot;/misc/3.pdf&quot; type=&quot;application/pdf&quot; width=&quot;100%&quot; height=&quot;700px&quot; style=&quot;max-width: 700px;&quot;&gt;
    &lt;embed src=&quot;/misc/3.pdf&quot; /&gt;
        &lt;p&gt;This browser does not support PDFs. Please download the PDF to view it: &lt;a href=&quot;/misc/3.pdf&quot;&gt;Download PDF&lt;/a&gt;.&lt;/p&gt;
    &amp;lt;/embed&amp;gt;
&lt;/object&gt;

&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;h1 id=&quot;wednesday&quot;&gt;Wednesday&lt;/h1&gt;

&lt;object data=&quot;/misc/4.pdf&quot; type=&quot;application/pdf&quot; width=&quot;100%&quot; height=&quot;700px&quot; style=&quot;max-width: 700px;&quot;&gt;
    &lt;embed src=&quot;/misc/4.pdf&quot; /&gt;
        &lt;p&gt;This browser does not support PDFs. Please download the PDF to view it: &lt;a href=&quot;/misc/4.pdf&quot;&gt;Download PDF&lt;/a&gt;.&lt;/p&gt;
    &amp;lt;/embed&amp;gt;
&lt;/object&gt;

&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;h1 id=&quot;thursday&quot;&gt;Thursday&lt;/h1&gt;

&lt;object data=&quot;/misc/5.pdf&quot; type=&quot;application/pdf&quot; width=&quot;100%&quot; height=&quot;700px&quot; style=&quot;max-width: 700px;&quot;&gt;
    &lt;embed src=&quot;/misc/5.pdf&quot; /&gt;
        &lt;p&gt;This browser does not support PDFs. Please download the PDF to view it: &lt;a href=&quot;/misc/5.pdf&quot;&gt;Download PDF&lt;/a&gt;.&lt;/p&gt;
    &amp;lt;/embed&amp;gt;
&lt;/object&gt;

&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;

</description>
        <pubDate>Sat, 17 Aug 2019 00:00:00 +0000</pubDate>
        <link>http://jrmeyer.github.io/asr/2019/08/17/INTERSPEECH-schedule.html</link>
        <guid isPermaLink="true">http://jrmeyer.github.io/asr/2019/08/17/INTERSPEECH-schedule.html</guid>
        
        
        <category>ASR</category>
        
      </item>
    
      <item>
        <title>Kaldi Troubleshooting Head-to-Toe</title>
        <description>&lt;p&gt;&lt;img src=&quot;/misc/kaldi-troubleshooting.png&quot; align=&quot;right&quot; style=&quot;width: 300px;&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;The following guide is for those who have already &lt;a href=&quot;http://jrmeyer.github.io/kaldi/2016/01/26/Installing-Kaldi.html&quot;&gt;installed Kaldi&lt;/a&gt;, trained a GMM model, and &lt;a href=&quot;http://jrmeyer.github.io/kaldi/2016/12/15/DNN-AM-Kaldi.html&quot;&gt;trained a DNN model&lt;/a&gt;, but the final system isn’t performing well.&lt;/p&gt;

&lt;p&gt;If you’re looking to get started with Kaldi, feel free to click on either of the above links and then come back to this guide as needed.&lt;/p&gt;

&lt;p&gt;If you’re looking for a quick answer to a hyperparameter setting, check out this &lt;a href=&quot;http://jrmeyer.github.io/asr/2019/08/17/Kaldi-cheatsheet.html&quot;&gt;Kaldi Cheatsheet&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you’d like a simple, easy to understand Kaldi recipe, you can check out the &lt;a href=&quot;https://github.com/JRMeyer/easy-kaldi&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;easy-kaldi&lt;/code&gt; GitHub repo&lt;/a&gt;. You probably won’t get state-of-the-art results with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;easy-kaldi&lt;/code&gt;, but you will hopefully be able to understand the pipeline.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;big-picture&quot;&gt;Big Picture&lt;/h2&gt;

&lt;p&gt;The typical Kaldi training pipeline consists of the following four steps:&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;table align=&quot;center&quot;&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;strong&gt;Step&lt;/strong&gt;&lt;/th&gt;
      &lt;th&gt;&lt;em&gt;Dependencies&lt;/em&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Train Monophones&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;pairs of &lt;em&gt;&amp;lt;utterance, transcript&amp;gt;&lt;/em&gt; training data&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Train Triphones&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;em&gt;Monophone alignments&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Train Speaker Adaptations&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;em&gt;Triphone alignments&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Train Deep Neural Network&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;em&gt;Triphone + Speaker Adaptation alignments&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;The first three steps all involve Gaussian Mixture Models and Hidden Markov Models (GMM-HMMs). So, even if you only care about the Deep Neural Network (DNN), you can’t avoid GMMs.&lt;/p&gt;

&lt;p&gt;More importantly, you can’t just train those GMMs, generate the alignments for the deep neural network, and spend all your time tweaking neural net parameters in the hopes of getting a great model. You need to take the time to ensure that your GMMs are as high-performing as possible, or else all the parameter tweaking in the world won’t save your neural net. The GMMs are the foundation upon which your neural network is built, and you need to make sure you have a strong foundation. Otherwise you’re just building a house on sand.&lt;/p&gt;

&lt;p&gt;The following document is a walk-through and troubleshooting guide for the entire training pipeline in Kaldi.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;problem-statement&quot;&gt;Problem Statement&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Problem: You have a trained Kaldi system, but it performs poorly.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Once you have a working Automatic Speech Recognition (ASR) system built on Kaldi, your next step will be to improve that system’s performance. To be clear, by “ASR system”, I’m referring to the combination of the Acoustic Model and the Language Model.&lt;/p&gt;

&lt;p&gt;Word Error Rates (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WER&lt;/code&gt;) are the metric we most often use when evaluating a system’s performance, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WER&lt;/code&gt; must be interpreted as the combined performance of the two parts: (Acoustic Model and Language Model) – remember that. To improve &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WER&lt;/code&gt; as much as possible, you may need to address issues in &lt;em&gt;both&lt;/em&gt; models. Nevertheless, isolated improvements to either model should lead to improvements in overall &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WER&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WER&lt;/code&gt; is the most important metric to optimize  , but in the following we will focus on other metrics and data which represent only the Acoustic Model. We will troubleshoot starting from the last step of Kaldi training (i.e. the DNN), and work our way backwards to the first step (i.e. the Monophones).&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;troubleshooting-the-neural-network-acoustic-model&quot;&gt;Troubleshooting the Neural Network Acoustic Model&lt;/h2&gt;

&lt;p&gt;The first thing we need to do is identify the source of the problem: the Acoustic Model or the Language Model. It’s hard to troubleshoot the Language Model on it’s own, so we will start with the neural Acoustic Model. If after following this guide you conclude that the Acoustic Model is performing fine, then you should spend time on the Language Model (i.e. try training on new text data, train larger order N-grams, etc).&lt;/p&gt;

&lt;p&gt;The Acoustic Model sends its output (i.e. phoneme predictions) to the Language Model, and then the Language Model tries to translate those predictions into words. A junk Acoustic Model will send junk predictions down the pipeline to the Language Model, and in the end you’ll get junk output from the Language Model.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h3 id=&quot;frame-classification-accuracy&quot;&gt;Frame-Classification Accuracy&lt;/h3&gt;

&lt;p&gt;As mentioned above, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WER&lt;/code&gt; is the most important metric for overall system performance. However, with regards to the neural Acoustic Model, &lt;strong&gt;frame-classification accuracy&lt;/strong&gt; is the most relevant metric you can optimize. This metric tells you how well the Acoustic Model is able to assign a class label (i.e. phoneme ID) to a new slice of audio (i.e. audio frame). The Acoustic Model exists solely to perform this one task, and if the Acoustic Model is performing this task poorly the overall &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WER&lt;/code&gt; will be high.&lt;/p&gt;

&lt;p&gt;To find data on frame-classification accuracy, we need to look at the relevant Kaldi log files: (1) &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;compute_prob_train.*.log&lt;/code&gt; and (2) &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;compute_prob_valid.*.log&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Here’s an example of what the contents of one of these log files could look like from an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nnet2&lt;/code&gt; model, using the Unix program &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cat&lt;/code&gt;:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cat &lt;/span&gt;log/compute_prob_valid.10.log
&lt;span class=&quot;c&quot;&gt;# nnet-compute-prob exp/nnet2_online/nnet_a/10.mdl ark:exp/nnet2_online/nnet_a/egs/valid_diagnostic.egs &lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Started at Sun May  7 18:05:18 UTC 2019&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#&lt;/span&gt;
nnet-compute-prob exp/nnet2_online/nnet_a/10.mdl ark:exp/nnet2_online/nnet_a/egs/valid_diagnostic.egs 
LOG &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;nnet-compute-prob[5.2.110~1-1d137]:main&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;:nnet-compute-prob.cc:91&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; Saw 4000 examples, average probability is &lt;span class=&quot;nt&quot;&gt;-4&lt;/span&gt;.32658 and accuracy is 0.275 with total weight 4000
&lt;span class=&quot;nt&quot;&gt;-4&lt;/span&gt;.32658
&lt;span class=&quot;c&quot;&gt;# Accounting: time=6 threads=1&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Ended (code 0) at Sun May  7 18:05:24 UTC 2019, elapsed time 6 seconds&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;These log files contain lines of the form &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;accuracy is X&lt;/code&gt;, where &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;X&lt;/code&gt; is the percent frame-classification error. The log file name contains the iteration number in training the neural net (e.g. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;compute_prob_train.10.log&lt;/code&gt; contains accuracy on the training data for the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;10&lt;/code&gt;th iteration).&lt;/p&gt;

&lt;p&gt;There are two important kinds of frame-classification accuracy: (1) the accuracy on the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;training&lt;/code&gt; set, and (2) the accuracy on a held-out &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;validation&lt;/code&gt; set. The filename of the log contains information as to whether the statistics pertain to the training set or the validation set (i.e. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;compute_prob_train.*.log&lt;/code&gt; is the accuracy on the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;training&lt;/code&gt; set, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;compute_prob_valid.*.log&lt;/code&gt; is the accuracy on the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;validation&lt;/code&gt; set).&lt;/p&gt;

&lt;p&gt;These two accuracies give you very important information. Here is an example graph of these two metrics over time, where time is measured in iterations of backpropagation during training. We can see that on the training data, accuracy is reaching over 80%, but on the held-out validation data, the performance is actually getting worse over time, sinking towards 20% accuracy. This model has overfit the training data.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;
&lt;img src=&quot;/misc/figs/frame-classification-accuracy.png&quot; align=&quot;center&quot; style=&quot;height: 500px&quot; /&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;The data used to create this graph come from the relevant log files: (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;compute_prob_train.*.log&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;compute_prob_valid.*.log&lt;/code&gt;). The visualization scripts used to plot the data can be found here (&lt;a href=&quot;https://github.com/JRMeyer/multi-task-kaldi/blob/master/mtk/utils/format_accuracy_for_plot.sh&quot;&gt;format data&lt;/a&gt;) and here (&lt;a href=&quot;https://github.com/JRMeyer/multi-task-kaldi/blob/master/mtk/utils/plot_accuracy.py&quot;&gt;plot data&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;First, run the formatting script. It takes three args (1) the directory in which the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;compute.*.log&lt;/code&gt; files are saved, and (2) whether you’re using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nnet2&lt;/code&gt; vs &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nnet3&lt;/code&gt;, and (3) the filename of an output file which will then be used for plotting.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;./format_accuracy_for_plot.sh &lt;span class=&quot;s2&quot;&gt;&quot;/full/path/to/log/&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;nnet3&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;output_file.txt&quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Second, take the newly formatted data and plot it with Python.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;python3 plot_accuracy.py &lt;span class=&quot;nt&quot;&gt;-n&lt;/span&gt; 1 &lt;span class=&quot;nt&quot;&gt;-t&lt;/span&gt; “My Awesome Title” &lt;span class=&quot;nt&quot;&gt;-i&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;output_file.txt&quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;There will be two lines plotted, one for the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;validation&lt;/code&gt; data and one for the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;training&lt;/code&gt; data. The flag &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-n&lt;/code&gt; is for number of tasks (because I also use the script for multi-task research). You just set number of tasks to one.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h3 id=&quot;how-to-interpret-frame-classification-accuracy-on-training-data&quot;&gt;How to interpret frame-classification accuracy on training data?&lt;/h3&gt;

&lt;p&gt;Frame-classification accuracy on the training set (i.e. data from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;compute_prob_train.*.log&lt;/code&gt;) tells you how well your net is performing on the data it sees during training. If you make your neural net big enough and run enough training iterations (i.e. epochs), then you will get 100% accuracy on this data. This is a bad thing. When you get really high classification accuracy on the training set, this is typically a sign of overfitting. Your Acoustic Model has stopped “learning” patterns in speech, and started “memorizing” your data. Once you’ve overfit, your model is doing more table look-up than pattern recognition. So, getting 100% frame classification accuracy on your training data is a bad thing.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h3 id=&quot;what-to-do-about-neural-net-overfitting-on-training-data&quot;&gt;What to do about neural net overfitting on training data?&lt;/h3&gt;

&lt;p&gt;Two simple solutions: (1) Make the net smaller, and (2) don’t run so many epochs. There are many other, more complex solutions, but these two are good first steps.&lt;/p&gt;

&lt;p&gt;When changing the size and architecture of the model, I’d suggest to first experiment with adjusting the number of hidden layers, and only afterwards experiment with the number of dimensions in each hidden layer. You should see larger increases or decreases in performance that way. To get an idea of what the code looks like for these parameters, check out an example of &lt;a href=&quot;https://github.com/kaldi-asr/kaldi/blob/3f95ed9185d8f8f76f9fb71c915119bf8b945a66/egs/wsj/s5/steps/nnet3/train_tdnn.sh#L92&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nnet3&lt;/code&gt; code here&lt;/a&gt; or &lt;a href=&quot;https://github.com/kaldi-asr/kaldi/blob/3f95ed9185d8f8f76f9fb71c915119bf8b945a66/egs/wsj/s5/steps/nnet2/train_pnorm.sh#L92&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nnet2&lt;/code&gt; code here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h3 id=&quot;how-to-interpret-frame-classification-accuracy-on-validation-data&quot;&gt;How to interpret frame-classification accuracy on validation data?&lt;/h3&gt;

&lt;p&gt;The second metric from Acoustic Model training is frame-classification accuracy on a held-out validation set. You will find this information in the log files of the type: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;compute_prob_valid.*.log&lt;/code&gt;.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Frame-classification accuracy on the held-out validation set is the metric you want to optimize in DNN Acoustic Model training.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Frame-classification accuracy on the held-out validation set represents how well your Acoustic Model is able to classify data that it never saw before. There is no chance that the model ``memorized’’ this data.&lt;/p&gt;

&lt;p&gt;Getting bad accuracy on validation data can mean two things: (1) there’s a problem with your data, and/or (2) there’s a problem with your model. If you’re using off-the-shelf code from Kaldi, you probably don’t have issues in your model. You might have to change the size of the model, but that’s all. What’s more often the issue is the data. You might have too little data, too noisy of data, or you might have mis-labeled data. Your mis-labeled data could mean that (1) the human transcriber did a bad job writing down all the correct words spoken in the audio, (2) the phonetic dictionary has incorrect pronunciations for words, or (3) the GMM-HMM system did a bad job aligning data to monophones or triphones.&lt;/p&gt;

&lt;p&gt;If your data and labels are wrong, then your neural model won’t be able to learn anything. Think of the case where you’re training a neural net to identify pictures of dogs and cats, but you had a bad annotator who labeled a bunch of dogs as if they were cats. Your system won’t be able to learn what a dog looks like, because the training data was wrong in the first place. In Kaldi (and hybrid DNN-HMM speech recognition in general) we don’t have a human annotator labeling each chunk of audio as belonging to a certain phoneme. Rather, we use a GMM-HMM system to produce those annotations via forced alignment.&lt;/p&gt;

&lt;p&gt;In what follows we’re going to talk about troubleshooting the GMMs systems you use in Kaldi to generate those alignments (i.e. monophones / triphones / LDA + MLLT / SAT).&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;troubleshooting-your-gmm-hmms&quot;&gt;Troubleshooting your GMM-HMMs&lt;/h2&gt;

&lt;p&gt;Now that we know why GMMs are so important, let’s find out if they’re working correctly. There’s a few important metrics and datapoints to help you gauge how well your GMMs are performing:&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;table align=&quot;center&quot;&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;strong&gt;Data Points&lt;/strong&gt;&lt;/th&gt;
      &lt;th&gt;&lt;em&gt;From where?&lt;/em&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Alignments&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;em&gt;training data&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Transcripts&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;em&gt;decoded test data&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WER&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;em&gt;decoded test data&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;These three sources of information all tell us how a given GMM model is performing, and it’s important to know where each piece comes from. The alignments, transcripts, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WER&lt;/code&gt; all are generated as outputs from the GMM-HMM training pipeline. Whether you’re training monophones or triphones with Speaker Adaptive Training (SAT), you will have to go through these same three steps, and as a result you will produce outputs which can be inspected.&lt;/p&gt;

&lt;p&gt;Where these GMM-HMM performance metrics come from:&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;table align=&quot;center&quot;&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;strong&gt;Step&lt;/strong&gt;&lt;/th&gt;
      &lt;th&gt;&lt;em&gt;Outputs&lt;/em&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Alignment&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;em&gt;Alignments&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Training&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;em&gt;GMM-HMMs&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Decoding&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;em&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WER&lt;/code&gt; + Transcripts&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;It’s hard to directly inspect GMM-HMMs, which is why we make use of the outputs of the training and testing phases (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WER&lt;/code&gt; / transcripts / alignments). The outputs listed above will be produced individually for each model you train, so you can see how the models (e.g. monophones vs. triphones) compare to each other. You can find the code corresponding to each of these three steps in the &lt;a href=&quot;https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/run.sh&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wsj/s5/run.sh&lt;/code&gt;&lt;/a&gt; file at the following locations.&lt;/p&gt;

&lt;p&gt;I chose to reference this &lt;a href=&quot;https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/run.sh&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;run.sh&lt;/code&gt;&lt;/a&gt; from the Wall Street Journal recipe (i.e. called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wsj&lt;/code&gt; in Kaldi) because all other examples in Kaldi link back to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wsj&lt;/code&gt;. This is the root of all Kaldi example recipes (so-called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;egs&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h3 id=&quot;training-steps-in-wsjs5runsh&quot;&gt;Training Steps in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wsj/s5/run.sh&lt;/code&gt;&lt;/h3&gt;

&lt;p&gt;To the left is the line number, in the center is the actual code, and to the right is my comment on what the line refers to.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;97   steps/train_mono.sh &lt;span class=&quot;nt&quot;&gt;--boost-silence&lt;/span&gt; 1.25 &lt;span class=&quot;nt&quot;&gt;--nj&lt;/span&gt; 10 &lt;span class=&quot;nt&quot;&gt;--cmd&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$train_cmd&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\ &lt;/span&gt;                 &amp;lt;&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; mono
        data/train_si84_2kshort data/lang_nosp exp/mono0a

116  steps/train_deltas.sh &lt;span class=&quot;nt&quot;&gt;--boost-silence&lt;/span&gt; 1.25 &lt;span class=&quot;nt&quot;&gt;--cmd&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$train_cmd&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; 2000 10000 &lt;span class=&quot;se&quot;&gt;\ &lt;/span&gt;            &amp;lt;&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt;  tri
        data/train_si84_half data/lang_nosp exp/mono0a_ali exp/tri1

157  steps/train_lda_mllt.sh &lt;span class=&quot;nt&quot;&gt;--cmd&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$train_cmd&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\ &lt;/span&gt;                               &amp;lt;&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt;  tri + LDA+MLLT
        &lt;span class=&quot;nt&quot;&gt;--splice-opts&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;--left-context=3 --right-context=3&quot;&lt;/span&gt; 2500 15000 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
        data/train_si84 data/lang_nosp exp/tri1_ali_si84 exp/tri2b
          
211  steps/train_sat.sh &lt;span class=&quot;nt&quot;&gt;--cmd&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$train_cmd&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; 4200 40000 &lt;span class=&quot;se&quot;&gt;\ &lt;/span&gt;                   &amp;lt;&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt;  tri + LDA+MLLT + SAT
        data/train_si284 data/lang_nosp exp/tri2b_ali_si284 exp/tri3b&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h3 id=&quot;alignment-steps-in-wsjs5runsh&quot;&gt;Alignment Steps in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wsj/s5/run.sh&lt;/code&gt;&lt;/h3&gt;

&lt;p&gt;N.B. there is no explicit alignment of the tri + LDA+MLLT + SAT model in wsj/s5/run.sh&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;113  steps/align_si.sh &lt;span class=&quot;nt&quot;&gt;--boost-silence&lt;/span&gt; 1.25 &lt;span class=&quot;nt&quot;&gt;--nj&lt;/span&gt; 10 &lt;span class=&quot;nt&quot;&gt;--cmd&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$train_cmd&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\ &lt;/span&gt;                   &amp;lt;&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; mono
        data/train_si84_half data/lang_nosp exp/mono0a exp/mono0a_ali
        
154  steps/align_si.sh &lt;span class=&quot;nt&quot;&gt;--nj&lt;/span&gt; 10 &lt;span class=&quot;nt&quot;&gt;--cmd&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$train_cmd&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\ &lt;/span&gt;                                         &amp;lt;&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; tri
        data/train_si84 data/lang_nosp exp/tri1 exp/tri1_ali_si84
        
208  steps/align_si.sh &lt;span class=&quot;nt&quot;&gt;--nj&lt;/span&gt; 10 &lt;span class=&quot;nt&quot;&gt;--cmd&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$train_cmd&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\ &lt;/span&gt;                              &amp;lt;&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; tri + LDA+MLLT                                                                                       	  data/train_si284 data/lang_nosp exp/tri2b exp/tri2b_ali_si284&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h3 id=&quot;decoding-steps-in-wsjs5runsh&quot;&gt;Decoding Steps in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wsj/s5/run.sh&lt;/code&gt;&lt;/h3&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;103  steps/decode.sh &lt;span class=&quot;nt&quot;&gt;--nj&lt;/span&gt; 8 &lt;span class=&quot;nt&quot;&gt;--cmd&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$decode_cmd&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; exp/mono0a/graph_nosp_tgpr &lt;span class=&quot;se&quot;&gt;\ &lt;/span&gt;              &amp;lt;&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt;  mono
        data/test_eval92 exp/mono0a/decode_nosp_tgpr_eval92

126  steps/decode.sh &lt;span class=&quot;nt&quot;&gt;--nj&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$nspk&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--cmd&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$decode_cmd&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; exp/tri1/graph_nosp_tgpr &lt;span class=&quot;se&quot;&gt;\ &lt;/span&gt;             &amp;lt;&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt;  tri
        data/test_&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt; exp/tri1/decode_nosp_tgpr_&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;

167  steps/decode.sh &lt;span class=&quot;nt&quot;&gt;--nj&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;nspk&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--cmd&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$decode_cmd&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; exp/tri2b/graph_nosp_tgpr &lt;span class=&quot;se&quot;&gt;\ &lt;/span&gt;&amp;lt;&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; tri + LDA+MLLT
        data/test_&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt; exp/tri2b/decode_nosp_tgpr_&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;
  
230  steps/decode_fmllr.sh &lt;span class=&quot;nt&quot;&gt;--nj&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;nspk&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--cmd&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$decode_cmd&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\ &lt;/span&gt;             &amp;lt;&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt;  tri + LDA+MLLT + SAT
        exp/tri3b/graph_nosp_tgpr data/test_&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\ &lt;/span&gt;  
        exp/tri3b/decode_nosp_tgpr_&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;interpreting-wer&quot;&gt;Interpreting &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WER&lt;/code&gt;&lt;/h2&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WER&lt;/code&gt; are a measure of the combined performance of the Acoustic Model + Language Model, so we have to take them with a grain of salt when we use them to talk about the Acoustic Model alone. If the Acoustic Model is performing badly, but your Language Model is well suited to your data, you may still get good &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WER&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The decoding phase produces &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WER&lt;/code&gt;, which help you quickly gauge the performance of your model. For example, you might see something like the following:&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;table align=&quot;center&quot;&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;strong&gt;Step&lt;/strong&gt;&lt;/th&gt;
      &lt;th&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WER&lt;/code&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Monophones&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;em&gt;10%&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Triphones&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;em&gt;9%&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Triphones + LDA + MLLT&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;em&gt;7%&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Triphones + LDA + MLLT + SAT&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;em&gt;82%&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Train Deep Neural Network&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;em&gt;80%&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;In this case, you know that something went wrong between stage &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;triphones + LDA + MLLT&lt;/code&gt; and stage &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;triphones + LDA + MLLT + SAT&lt;/code&gt;, because all previous models were doing just fine. We don’t have to worry about trying to debug those previous models, because errors are propagated only from the most recent model. In the example shown above, you don’t have to waste time looking at your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;monophones&lt;/code&gt; or vanilla &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;triphones&lt;/code&gt; (i.e. delta+delta triphones in Kaldi-speak), because they couldn’t have been responsible.&lt;/p&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WER&lt;/code&gt; for your models can be found in the following locations within your experiment directory (i.e. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;exp&lt;/code&gt;):&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;exp/mono0a/decode_&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;/wer_&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;        &amp;lt;&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; monophones
 
exp/tri1/decode_&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;/wer_&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;          &amp;lt;&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt;  triphones

exp/tri2b/decode_&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;/wer_&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;         &amp;lt;&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt;  triphones + LDA + MLLT

exp/tri3b/decode_&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;/wer_&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;         &amp;lt;&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt;  triphones + LDA + MLLT + SAT&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Here’s an example of how &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WER&lt;/code&gt; are reported:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cat&lt;/span&gt; /tri2b/decode_test_clean/wer_8_0.0
compute-wer &lt;span class=&quot;nt&quot;&gt;--text&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--mode&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;present ark:exp/tri2b/decode_test_clean/scoring/test.txt ark,p:- 
%&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;WER&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt; 7.82 &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; 4109 / 52576, 692 ins, 314 del, 3103 sub &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;
%SER 60.04 &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; 1573 / 2620 &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;
Scored 2620 sentences, 0 not present &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;hyp&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;As you can see, we get more info than just the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WER&lt;/code&gt;. We also get the Sentence Error Rate (SER) and importantly, we get info on the size of the test set. Here we see that 2620 sentences were in this test set.&lt;/p&gt;

&lt;p&gt;We also get information on how many times the system failed during decoding: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0 not present in hyp&lt;/code&gt;. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hyp&lt;/code&gt; stands for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hypothesis&lt;/code&gt;, and if that number is greater than &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0&lt;/code&gt;, it means that the model was unable to generate a hypothesis transcript for atleast one utterance (i.e. decoding failed). Failed decoding may happen if the decoding beam is too small or if the audio is exceptionally noisy.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;interpreting-alignments&quot;&gt;Interpreting Alignments&lt;/h2&gt;

&lt;p&gt;The alignment logs are found in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;exp/*_ali/log/ali.*.gz&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;After &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WER&lt;/code&gt;, the second place we can troubleshoot a model is by looking at the alignments that were used to train that model. If the alignments are bad, then the resulting model will perform poorly.&lt;/p&gt;

&lt;p&gt;As we saw in the &lt;a href=&quot;#big-picture&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Big Picture&lt;/code&gt;&lt;/a&gt;, at each step information from the previous model (e.g. monophones) is passed to the next model (e.g. triphones) via alignments. As such, it’s important that you’re sure that at each step the information passed downstream is accurate. Before we jump into the alignment files, let’s talk a little about what alignments are in the first place.&lt;/p&gt;

&lt;p&gt;You can think of alignments as audio with time-stamps, where the time-stamps correspond to the sounds that were spoken in that audio (i.e. phonemes – sometimes called senomes, which may also be graphemes or just characters).&lt;/p&gt;

&lt;p&gt;As you can see in the image below, there is some audio utterance (shown on top via waveform), and we have some text utterance associated with that audio, and after alignment we end up with the bottom row (alignments of individual phonemes to slices of audio).&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;
&lt;img src=&quot;/misc/figs/monophone-alignments.png&quot; align=&quot;center&quot; style=&quot;height: 500px&quot; /&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;This alignment corresponds to a monophone model, because each phoneme (i.e. linguistic speech sound) is modeled by only one unit. For example, you can see that there are multiple instances of the sound &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;S&lt;/code&gt;, and they are all coded exactly the same (i.e. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;S&lt;/code&gt;). Each instance of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;S&lt;/code&gt; is actually acoustically different from the others, because each instance has its own neighboring sounds which affect it: this is called the co-articulation affect.&lt;/p&gt;

&lt;p&gt;Triphone models take this co-articulation effect into account by modeling phonemes based on context. The triphone alignment would look something more like this:&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;
&lt;img src=&quot;/misc/figs/triphone-alignments.png&quot; align=&quot;center&quot; style=&quot;height: 500px&quot; /&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;The sounds are now encoded with their surrounding context. For example, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;IH-Z-AH&lt;/code&gt; represents the sound &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Z&lt;/code&gt; which is preceded by the sound &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;IH&lt;/code&gt; and followed by the sound &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AH&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Triphone alignments are more complicated than monophones. This extra complication allows for the modeling of more nuanced phonetic details. In the monophone alignments we see that the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;S&lt;/code&gt; from “this” and the ‘S’ from “is” are modeled as the same sound. However, in reality these two &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;S&lt;/code&gt; sounds are acoustically very different, because they have different neighboring phonemes. The monophone alignments lose this distinction, but the triphones capture it. These two ‘S’s are modeled as separate units in the triphone alignments.&lt;/p&gt;

&lt;p&gt;Triphone models take into account (a) the central phoneme of interest, (b) the phoneme to the immediate left, and (c) the phoneme to the immediate right. As such, triphones take into account three phonemes (this is where the word “triphone” comes).&lt;/p&gt;

&lt;p&gt;Each step in the training pipeline (monophone, triphone, triphone + adaptation) generates its own set of alignments, and these alignments can be found in the following locations:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;exp/mono0a_ali/ali.&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;.gz          &amp;lt;&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt;  monophones

exp/tri1_ali/ali.&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;.gz            &amp;lt;&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt;  triphones

exp/tri2b_ali/ali.&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;.gz           &amp;lt;&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt;  triphones + LDA + MLLT

exp/tri3b_ali/ali.&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;.gz           &amp;lt;&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt;  triphones + LDA + MLLT + SAT&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h3 id=&quot;direct-inspection-of-alignments&quot;&gt;Direct inspection of alignments&lt;/h3&gt;

&lt;p&gt;We can directly inspect these alignments with Kaldi tools.&lt;/p&gt;

&lt;p&gt;What we would like to see from these tools are the following: (1) all the words spoken in the utterance are found in the alignment, and (2) the words and phonemes are appropriately spaced in time. For both of these, the file &lt;a href=&quot;https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/steps/get_train_ctm.sh&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get_train_ctm.sh&lt;/code&gt;&lt;/a&gt; will be particularly helpful because it shows the words in the utterance along with their time-stamps.&lt;/p&gt;

&lt;p&gt;You should be able to see if the transcript contains missing words, extra words, or if the timing is off. I’d suggest you look at an utterance from the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ctm&lt;/code&gt; file and listen to the audio at the same time. This way you can compare what you hear to what the system is producing.&lt;/p&gt;

&lt;p&gt;Here’s how you can pull out the word-level alignments from one file (e.g. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ID_354.wav&lt;/code&gt;):&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;c&quot;&gt;# create the ctm file&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;steps/get_train_ctm.sh data/train/ data/lang exp/tri_ali/

&lt;span class=&quot;c&quot;&gt;# pull out one utt from ctm file&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;grep &lt;/span&gt;ID_354 exp/tri_ali/ctm 
ID_354 1 0.270 0.450 the 
ID_354 1 0.720 0.420 dog 
ID_354 1 1.140 0.330 ran 
ID_354 1 1.470 0.510 home&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;These word-level alignments look good, but we can even get a finer-grained look at alignments by inspecting the phoneme-level alignments with &lt;a href=&quot;https://github.com/kaldi-asr/kaldi/blob/master/src/bin/show-alignments.cc&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;show-alignments.cc&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;show-alignments data/lang/phones.txt exp/tri1_ali/final.mdl ark:&lt;span class=&quot;s2&quot;&gt;&quot;gunzip -c ali.1.gz |&quot;&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;grep&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;ID_354&quot;&lt;/span&gt;
ID_354 &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; 4 1 1 1 16 18 &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; 21958 21957 22115 22115 &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; 39928 40076 40075 40296 40295 40295 40295 40295 &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; 44894 44893 45211 45211 &lt;span class=&quot;o&quot;&gt;][&lt;/span&gt; 4 16 18 &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; 28226 28268 28267 28267 28267 28267 28267 28267 28314 28313 &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; 34762 34846 34914 &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; 63728 63918 63958 &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; 4 1 18 &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;
ID_354  SIL       DH      AH1  D              AA    G        R      AE      N HH               OW         M                 SIL &lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;I just made up the alignment above to fit nicely in this guide, but the structure is identical to what &lt;a href=&quot;https://github.com/kaldi-asr/kaldi/blob/master/src/bin/show-alignments.cc&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;show-alignments.cc&lt;/code&gt;&lt;/a&gt; will produce. There are two lines produced, which both begin with the utterance ID of the audio file we’re interested in (e.g. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ID_354&lt;/code&gt;). The first line has a bunch of numbers and square brackets separated by spaces, where the individual numbers correspond to sub-parts of phonemes (i.e. individual states of the HMM-GMM), and the brackets show you where the phonemes themselves start and end. You should find two square brackets (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;]&lt;/code&gt;) for each phoneme in the utterance. A number is repeated as many times as the model passed through that HMM state. Phonemes which take up more time in the audio itself will have more numbers within their brackets.&lt;/p&gt;

&lt;p&gt;As you can see in the output from &lt;a href=&quot;https://github.com/kaldi-asr/kaldi/blob/master/src/bin/show-alignments.cc&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;show-alignments.cc&lt;/code&gt;&lt;/a&gt;, silence (i.e. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SIL&lt;/code&gt;) is made explicit at the beginning and end of each utterance. The alignment procedure assumes that there is some silence at the beginning and end of the audio file. If you don’t have silence in the audio, your alignment procedure will still proceed as if there really is silence in the audio. As such, your GMM-HMM model will ``find’’ silence in the audio, even if it isn’t there, and estimate the acoustic properties of silence based on speech instead – this is bad. You should try to have silence at the beginning and end of your audio files.&lt;/p&gt;

&lt;p&gt;A very nice thing about inspecting alignments is that they are independent of the Language Model. That is, only the phonetic dictionary and the training transcripts are used to generate the alignments.&lt;/p&gt;

&lt;p&gt;You can get an idea of which words are causing alignment issues by looking into the log flies from alignment (i.e. the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;acc.*.log&lt;/code&gt; files, which accumulate stats from alignments) as such. When an alignment fails, you will find “No alignment for utterance” and then the utterance ID in these &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;acc.*.log&lt;/code&gt; files. To find all issues in your triphones, simply grep as such:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;grep&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-r&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;No alignment&quot;&lt;/span&gt; exp/tri&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;/log/acc.&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;.log&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This usually means either there’s a mis-match between the audio and the transcript, or the audio is really noisy.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;grep&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-r&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;gmm-align-compiled.* errors on &quot;&lt;/span&gt; exp/tri&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;_ali/log/align_pass&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;.log&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;For statistics on the phonemes you trained and where they’re located, take a look the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;analyze_alignments.log&lt;/code&gt; files which you find in your alignment directories (e.g. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;exp/tri*_ali/log/analyze_alignments.log&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h3 id=&quot;interpreting-the-decoding-transcripts-from-test-data&quot;&gt;Interpreting the Decoding Transcripts from Test Data&lt;/h3&gt;

&lt;p&gt;Another place that we can evaluate the performance of a GMM-HMM model is by inspecting the transcripts that it produced at decoding time on the test data. These results reflect both the Acoustic Model, the Language Model, and the decoding time parameters.&lt;/p&gt;

&lt;p&gt;You can find the decoding output in the following log files:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;exp/mono0a/decode_&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;/scoring/log/best_path.&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;.log &amp;lt;&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; monophones
 
exp/tri1/decode_&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;/scoring/log/best_path.&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;.log   &amp;lt;&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; triphones

exp/tri2b/decode_&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;/scoring/log/best_path.&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;.log  &amp;lt;&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; triphones + LDA + MLLT

exp/tri3b/decode_&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;/scoring/log/best_path.&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;.log  &amp;lt;&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; triphones + LDA + MLLT + SAT&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;If we take a look into one of those log files, we will see something like this:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;head&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-25&lt;/span&gt; tri2a/decode_test/scoring/log/best_path.10.0.0.log | &lt;span class=&quot;nb&quot;&gt;tail
&lt;/span&gt;ID-0004 NUMBERED DEN A FRESH NELLIE IS WAITING ON YOU GOOD NIGHT HUSBAND 
LOG &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;lattice-best-path[5.2.110~1-1d137]:main&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;:lattice-best-path.cc:99&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; For utterance ID-0005, best cost 204.174 + 3882.34 &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 4086.51 over 962 frames.
ID-0005 THE MUSIC CAME NEARER AND HE RECALLED THE WORDS THE WORDS OF SHELLEY&lt;span class=&quot;s1&quot;&gt;&apos;S FRAGMENT UPON THE MOON WANDERING COMPANIONLESS PALE FOR WEARINESS 
LOG (lattice-best-path[5.2.110~1-1d137]:main():lattice-best-path.cc:99) For utterance ID-0006, best cost 235.088 + 4450.84 = 4685.93 over 1054 frames.
ID-0006 THE DULL LIGHT FELL MORE FAINTLY UPON THE PAGE WHERE ON ANOTHER EQUATION BEGAN TO UNFOLD ITSELF SLOWLY AND TO SPREAD ABROAD ITS WIDENING TALE 
LOG (lattice-best-path[5.2.110~1-1d137]:main():lattice-best-path.cc:99) For utterance ID-0007, best cost 90.7232 + 1771.01 = 1861.73 over 426 frames.
ID-0007 A COLD LUCID IN DIFFERENCE OR REINED IN HIS SOUL 
LOG (lattice-best-path[5.2.110~1-1d137]:main():lattice-best-path.cc:99) For utterance ID-0008, best cost 136.785 + 2868.84 = 3005.63 over 671 frames.&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;best_path&lt;/code&gt; in the filename &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;best_path.10.0.0.log&lt;/code&gt; means that the output represents the 1-best path through the decoding lattice for each utterance. This is very useful, because this shows you how the model is performing at test-time, and you can spot errors and biases here more easily.&lt;/p&gt;

&lt;p&gt;The first few lines are logging data, and the lines in all caps are the model’s prediction on some testing data. These lines show (1) the utterance ID  (e.g. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ID-0007&lt;/code&gt;), and (2) the prediction of the model for that utterance (e.g. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A COLD LUCID IN DIFFERENCE OR REINED IN HIS SOUL&lt;/code&gt;). It is good to listen to the audio file and look at the corresponding output to identify errors.&lt;/p&gt;

&lt;p&gt;Sometimes you can even see if the errors stem from the Acoustic Model or from the Language Model (e.g. the model predicted &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;IN DIFFERENCE&lt;/code&gt; instead of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;INDIFFERENCE&lt;/code&gt;, which is a Language Model problem given that both options are acoustically identical).&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h3 id=&quot;what-next-for-acoustic-model-troubleshooting&quot;&gt;What next for Acoustic Model troubleshooting?&lt;/h3&gt;

&lt;p&gt;If at this point you’ve run all the inspections above, your GMM-HMM model should perform well and your alignments should look good. What to do if you’re still having issues in training a good neural net?&lt;/p&gt;

&lt;p&gt;Well, as mentioned above, if you’re overfitting your training data, then try to reduce the size of the model as well as the number of epochs you run. At this point you might need to do some hyper-parameter searching within the suggestions I provide below as a cheat-sheet. Try to identify consistent problems in the output of the combined model (Language Model + Acoustic Model).&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;language-model&quot;&gt;Language Model&lt;/h2&gt;

&lt;p&gt;The Language Model (LM) is indeed very important for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WER&lt;/code&gt;. The LM encodes (a) which words are possible, (b) which words are impossible, and (c) how probable each possible word is. A word may only be decoded if it occurs in the vocabulary of the Language Model. Any word that does not occur in the vocabulary is an Out-Of-Vocabulary (OOV) word. The Language Model contains a probability for each word and word sequence.&lt;/p&gt;

&lt;p&gt;The LM you use in your application and the LM you use at test time to not need to be the same. However, if you’re optimizing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WER&lt;/code&gt; at test-time by adjusting Language Model parameters, it isn’t guaranteed that improvements will transfer to inference with a new LM.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h3 id=&quot;notes-on-the-language-model-you-use-at-test-time&quot;&gt;Notes on the Language Model you use at Test-Time&lt;/h3&gt;

&lt;p&gt;Improvements in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WER&lt;/code&gt; which come from Acoustic Model parameter changes should generalize across Language Models. For example, if you find that for a bi-gram Language Model a 5-layer neural net Acoustic Model works better than a 6-layer net, you should expect that (for your data), a 5-layer Acoustic Model will beat out a 6-layer net, regardless of the n-gram order of the Language Model.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;In short, when you’re troubleshooting your Kaldi system, first look at the training statistics for your  Deep Neural Network Acoustic Model. Then, look at the training, alignment, and decoding stats of your GMM-HMMs. If all those data look good, then try training a Language Model more suited for your use-case.&lt;/p&gt;

&lt;p&gt;If you still are running into issues, look on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kaldi-help&lt;/code&gt; to see if someone else has had your problem (often they have). As far as I know, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kaldi-help&lt;/code&gt; is the only forum for these kinds of questions. Unfortunately, the atmosphere on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kaldi-help&lt;/code&gt; can be unwelcoming to newcomers. If you post a question and get a critical response, don’t let it upset you. The forum can be un-welcoming, and it has nothing to do with you or your abilities to do good ASR! I was afraid to post on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kaldi-help&lt;/code&gt; for a long time because of this atmosphere, and indeed my first post was not received kindly. Alternatively, you can post questions here on my blog. However, the community here is not as large as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kaldi-help&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I hope this was helpful, and happy Kaldi-ing!&lt;/p&gt;

&lt;p&gt;Let me know if you have comments or suggestions and you can always leave a comment below.&lt;/p&gt;

</description>
        <pubDate>Sat, 17 Aug 2019 00:00:00 +0000</pubDate>
        <link>http://jrmeyer.github.io/asr/2019/08/17/Kaldi-troubleshooting.html</link>
        <guid isPermaLink="true">http://jrmeyer.github.io/asr/2019/08/17/Kaldi-troubleshooting.html</guid>
        
        
        <category>ASR</category>
        
      </item>
    
      <item>
        <title>Kaldi Hyperparameter Cheatsheet</title>
        <description>&lt;p&gt;&lt;img src=&quot;/misc/kaldi_text_and_logo.png&quot; align=&quot;right&quot; style=&quot;width: 300px;&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;The following is a cheatsheet for common hyperparameters in Kaldi.&lt;/p&gt;

&lt;p&gt;If you’re looking to get started with Kaldi, here’s an &lt;a href=&quot;http://jrmeyer.github.io/kaldi/2016/01/26/Installing-Kaldi.html&quot;&gt;installation guide&lt;/a&gt; and &lt;a href=&quot;http://jrmeyer.github.io/kaldi/2016/12/15/DNN-AM-Kaldi.html&quot;&gt;DNN training guide&lt;/a&gt;. If you’d like a simple, easy to understand Kaldi recipe, you can check out the &lt;a href=&quot;https://github.com/JRMeyer/easy-kaldi&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;easy-kaldi&lt;/code&gt; GitHub repo&lt;/a&gt;. You probably won’t get state-of-the-art results with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;easy-kaldi&lt;/code&gt;, but you will hopefully be able to understand the pipeline.&lt;/p&gt;

&lt;p&gt;If you’re looking for more in-depth troubleshooting, check out &lt;a href=&quot;http://jrmeyer.github.io/asr/2019/08/17/Kaldi-troubleshooting.html&quot;&gt;Kaldi Troubleshooting from Head-to-Toe&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;hyperparameter-cheatsheet&quot;&gt;Hyperparameter Cheatsheet&lt;/h2&gt;

&lt;p&gt;The following parameter ranges are what I would recommend as a good starting place. However, what works for your data and your application may differ.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h3 id=&quot;gmm-hmm-alignment&quot;&gt;GMM-HMM Alignment&lt;/h3&gt;

&lt;p&gt;Each of these following steps depends on the previous step. If you have bad monophone alignments, you will have bad triphone alignments. If you have bad triphone alignments, then you will train a bad neural net. As such, you should take some time to tweak parameters on each stage, to make sure your model and alignments are good to pass on to the next stage.&lt;/p&gt;

&lt;p&gt;The parameters listed here have two values associated with them, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;N → M&lt;/code&gt;. Good model parameters for your data should be somewhere in between the extremes of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;N&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;M&lt;/code&gt;, so I’d advise some binary search to find good settings for you. Optimize for number of training iterations only after you’ve gotten good numbers for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;numleaves&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;totgauss&lt;/code&gt;.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;Monophones &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;steps/train_mono.sh&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;boost_silence&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1.25
&lt;span class=&quot;nv&quot;&gt;num_iters&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 20 → 40
&lt;span class=&quot;nv&quot;&gt;totgauss&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 1000 → 2000

Triphones &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;steps/train_deltas.sh&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;boost_silence&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 1.25
&lt;span class=&quot;nv&quot;&gt;num_iters&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 20 → 40
&lt;span class=&quot;nv&quot;&gt;numleaves&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 2000 → 5000
&lt;span class=&quot;nv&quot;&gt;totgauss&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 10000 → 50000

&lt;span class=&quot;c&quot;&gt;# Triphones + LDA + MLLT (steps/train_lda_mllt.sh)&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;--left-context&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 2 → 10
&lt;span class=&quot;nt&quot;&gt;--right-context&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 2 → 10
&lt;span class=&quot;nv&quot;&gt;num_iters&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 20 → 40
&lt;span class=&quot;nv&quot;&gt;numleaves&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 2500 → 7500
&lt;span class=&quot;nv&quot;&gt;totgauss&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 15000 → 75000

&lt;span class=&quot;c&quot;&gt;# Triphones + LDA + MLLT + SAT (steps/train_sat.sh)&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;num_iters&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 20 → 40
&lt;span class=&quot;nv&quot;&gt;numleaves&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 2500 → 10000
&lt;span class=&quot;nv&quot;&gt;totgauss&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 15000 → 200000&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h3 id=&quot;dnn-training&quot;&gt;DNN Training&lt;/h3&gt;

&lt;p&gt;You should ideally be using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nnet3&lt;/code&gt; instead of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nnet2&lt;/code&gt;. At this point, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nnet3&lt;/code&gt; is tried and tested more, and will have better support moving forward.&lt;/p&gt;

&lt;p&gt;Long-skinny nets are better than short fat ones. Monitor your training progress with information from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;compute_prob_train.*.log&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;compute_prob_valid.*.log&lt;/code&gt;.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;Number of epochs: 1 → 20
Number of Hidden Layers: 5 → 15
Dimension of Hidden Layers: 512 → 1280
Kind of Neural Network: TDNN or LSTM
Kind of non-linearity: ReLU or tanh&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h3 id=&quot;decoding&quot;&gt;Decoding&lt;/h3&gt;

&lt;p&gt;There’s a speed / accuracy trade-off at decoding time. You can decode faster by considering fewer possible words / phrases, but if you don’t consider the correct word, then you’ve missed it.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nv&quot;&gt;max_active&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 2000 → 7000
&lt;span class=&quot;nv&quot;&gt;min_active&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 50 → 200
&lt;span class=&quot;nv&quot;&gt;beam&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 5.0 → 15.0
&lt;span class=&quot;nv&quot;&gt;lattice_beam&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 1.0 → 8.0&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;If you still are running into issues, look on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kaldi-help&lt;/code&gt; to see if someone else has had your problem (often they have). As far as I know, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kaldi-help&lt;/code&gt; is the only forum for these kinds of questions. Unfortunately, the atmosphere on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kaldi-help&lt;/code&gt; can be unwelcoming to newcomers. If you post a question and get a critical response, don’t let it upset you. The forum can be un-welcoming, and it has nothing to do with you or your abilities to do good ASR! I was afraid to post on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kaldi-help&lt;/code&gt; for a long time because of this atmosphere, and indeed my first post was not received kindly. Alternatively, you can post questions here on my blog. However, the community here is not as large as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kaldi-help&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I hope this was helpful, and happy Kaldi-ing!&lt;/p&gt;

&lt;p&gt;Let me know if you have comments or suggestions and you can always leave a comment below.&lt;/p&gt;

</description>
        <pubDate>Sat, 17 Aug 2019 00:00:00 +0000</pubDate>
        <link>http://jrmeyer.github.io/asr/2019/08/17/Kaldi-cheatsheet.html</link>
        <guid isPermaLink="true">http://jrmeyer.github.io/asr/2019/08/17/Kaldi-cheatsheet.html</guid>
        
        
        <category>ASR</category>
        
      </item>
    
      <item>
        <title>How we added Kyrgyz to Mozilla&apos;s Common Voice project</title>
        <description>&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/misc/robot-greetings.png&quot; align=&quot;right&quot; style=&quot;width: 300px;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;Mozilla is helping build Kyrgyz voice technology (for free) via its Common Voice project. Anyone can go to the &lt;a href=&quot;https://voice.mozilla.org/en&quot;&gt;Common Voice website&lt;/a&gt; and record sentences for the project. We need as many speakers and accents as possible in order to create robust technologies. &lt;a href=&quot;https://voice.mozilla.org/ky&quot;&gt;Donate your voice now.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;what-is-common-voice&quot;&gt;What is Common Voice?&lt;/h2&gt;

&lt;p&gt;Common Voice is a data collection project from Mozilla, focused on collecting free and open-source data for speech recognition systems. To build a working speech recognition system (such as Siri or OK Google) the developer must first train the computer to understand how words in a language are pronounced. The system must be able to distinguish sounds like vowels and consonants, and to accomplish this you need lots of audio data.&lt;/p&gt;

&lt;p&gt;Mozilla is crowd-sourcing this data collection with the Common Voice project, which allows anyone to record their voice and upload it to the cloud. At any time, developers can download these collections of recordings and train their own speech technologies for any kind of application. For instance, Google could use this data to create a Kyrgyz-language voice assistant for Android phones, or Namba taxi could use this data to make a voice-powered iPhone app for Kyrgyz speakers. Anyone can download the data and use it for whatever project they wish!&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;why-is-this-important&quot;&gt;Why is this important?&lt;/h2&gt;

&lt;p&gt;Enabling Kyrgyz data collection to Mozilla’s Common Voice project is important because it will spur innovation in the Kyrgyz technology sector. Voice technologies are often more comfortable to use than typing, and for many people (such as people who are blind or handicapped) these technologies are essential to normal living. Voice technologies are widely used already for European languages such as English, French and Spanish because the appropriate datasets (i.e. large collections of voice recordings) already exist.&lt;/p&gt;

&lt;p&gt;However, the time and money required to create one of these datasets is a major hindrance for a new language, and Mozilla has developed a method of crowd-sourcing data collection which makes data collection much faster and free. Mozilla provides all the recordings under the Creative Commons CC-0 license, so the data is free to use for any purpose. This open license ensures that small companies have access to the same cutting-edge technologies as technology giants like Google.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;how-did-we-do-it&quot;&gt;How did we do it?&lt;/h2&gt;

&lt;p&gt;We took two main steps to add Kyrgyz to the Common Voice project:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Translate the user interface and information into Kyrgyz&lt;/li&gt;
  &lt;li&gt;Collect text sentences which users will read out-loud&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Common Voice team itself, lead by &lt;a href=&quot;https://video.golem.de/internet/20162/mozilla-common-voice-interview-englisch.html&quot;&gt;Michael Henretty&lt;/a&gt; and &lt;a href=&quot;https://video.golem.de/internet/20161/mozilla-deep-speech-interview-englisch.html&quot;&gt;Kelly Davis&lt;/a&gt; pushed the Kyrgyz language version into production. In addition, &lt;a href=&quot;https://www.hse.ru/en/news/campus/208242212.html&quot;&gt;Francis Tyers&lt;/a&gt; (computational linguist and language activist) aided in team coordination and translation oversight. &lt;a href=&quot;http://jrmeyer.github.io/about/&quot;&gt;Josh Meyer&lt;/a&gt; processed the text from Kloop.kg and helped coordinate the various teams.&lt;/p&gt;

&lt;h3 id=&quot;translation&quot;&gt;Translation&lt;/h3&gt;

&lt;p&gt;In order to translate the user interface into Kyrgyz, a team of contributors worked together to ensure that the translations are natural-sounding and accurate. This team includes &lt;a href=&quot;https://www.facebook.com/chorobek.saadanbekov&quot;&gt;Chorobek Saandanbek&lt;/a&gt; (director of the &lt;a href=&quot;http://bizdin.kg/&quot;&gt;Bizdin Muras Foundation&lt;/a&gt;), &lt;a href=&quot;https://www.facebook.com/saykal.maatkerimova&quot;&gt;Saikal Maatkerimova&lt;/a&gt; (Kyrgyz language instructor at &lt;a href=&quot;https://www.facebook.com/lingua.yurt&quot;&gt;Lingua Yurt&lt;/a&gt;), and &lt;a href=&quot;https://www.facebook.com/subanaliev&quot;&gt;Talgat Subanaliev&lt;/a&gt; (recent &lt;a href=&quot;https://www.auca.kg/&quot;&gt;AUCA&lt;/a&gt; graduate). As the interface evolves and expands, Saandanbek will lead the team to ensure that the translation is accurate and up-to-date.&lt;/p&gt;

&lt;p&gt;Mozilla has enabled crowd-sourcing of the interface translation itself, such that anyone can propose a new translation if they identify an error in translation. Kyrgyz speakers can help with Common Voice translation via the &lt;a href=&quot;https://pontoon.mozilla.org/ky/common-voice/&quot;&gt;Pontoon system&lt;/a&gt;. Once a new translation is proposed, the team leader (i.e. Saandanbek) will review and accept or defer the translation. In this way, problems are found quickly and resolved appropriately.&lt;/p&gt;

&lt;h3 id=&quot;text-collection&quot;&gt;Text Collection&lt;/h3&gt;

&lt;p&gt;To create a dataset for training speech technologies, collecting voices isn’t enough. We need to know what was said in every recording so that the computer can recognize words from the audio. As such, Mozilla has devised a system to display a text sentence on the screen, and then the speaker reads the sentence out loud so that each recording is saved along with the text. These sentences are difficult to find, because they must be under the &lt;a href=&quot;https://creativecommons.org/publicdomain/zero/1.0/deed&quot;&gt;Create Commons license CC-0&lt;/a&gt; so that Mozilla may freely distribute the text sentences and audio recordings together.&lt;/p&gt;

&lt;p&gt;Currently, all Kyrgyz text sentences used for this project come from the well-known Kyrgyz language news source &lt;a href=&quot;http://ky.kloop.asia/&quot;&gt;Kloop.kg&lt;/a&gt;. The founder of Kloop.kg, &lt;a href=&quot;https://twitter.com/bektour&quot;&gt;Bektour Iskender&lt;/a&gt; - a proponent of an open-internet and the Create Commons - allowed use of Kyrgyz language articles from Kloop to be distributed under CC-0. As such, when the user reads a sentence for Kyrgyz Common Voice, they are actually reading news from Kloop.kg. This is a major win for the Kyrgyz language and the open internet, because finding CC-0 text for Common Voice is typically the most difficult task in adding a new language. At least 5,000 different sentences should be initially recorded, and most books and online news (such as &lt;a href=&quot;https://www.bbc.com/kyrgyz&quot;&gt;BBC Kyrgyz&lt;/a&gt;) are not available under CC-0.&lt;/p&gt;

&lt;p&gt;After the text was automatically downloaded from Kloop (via &lt;a href=&quot;https://www.github.com/JRMeyer/web_corpus&quot;&gt;this Python script&lt;/a&gt;), the text was cleaned (all foreign words, numbers, abbreviations were removed) and sentences of an appropriate length were selected. Ideally each recording should be about 5 seconds long. More text can be added later, such that there is more diversity in the kinds of sentences read. Diversity is important for Common Voice, because good speech technologies should recognize the speech of people speaking with different accents about different topics.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;donate-your-voice&quot;&gt;Donate Your Voice!&lt;/h2&gt;

&lt;p&gt;In order for quality technologies to be created for the Kyrgyz language, we need more voices!&lt;/p&gt;

&lt;p&gt;Anyone can record and donate sentences for the Common Voice project, and the more voices we get, the more accurate the technology becomes.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://voice.mozilla.org/ky&quot;&gt;Donate your voice today!&lt;/a&gt;&lt;/p&gt;

</description>
        <pubDate>Wed, 29 May 2019 00:00:00 +0000</pubDate>
        <link>http://jrmeyer.github.io/misc/2019/05/29/mozilla-kyrgyz-common-voice.html</link>
        <guid isPermaLink="true">http://jrmeyer.github.io/misc/2019/05/29/mozilla-kyrgyz-common-voice.html</guid>
        
        
        <category>misc</category>
        
      </item>
    
      <item>
        <title>How to Train &lt;small&gt;&lt;i&gt;practically&lt;/i&gt;&lt;/small&gt; any Model from  &lt;small&gt;&lt;i&gt;practically&lt;/i&gt;&lt;/small&gt; any Data with TensorFlow</title>
        <description>&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/misc/tf-logo.png&quot; align=&quot;right&quot; alt=&quot;logo&quot; style=&quot;width: 225px;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;objectives&quot;&gt;Objectives&lt;/h2&gt;

&lt;p&gt;This post will guide you on how to take your data (in a CSV file) to a trained TensorFlow model of your choosing.&lt;/p&gt;

&lt;p&gt;You’re not going to find any tricks or hacks here. The title to this blog post is so general because the TensorFlow developers have created a great API for &lt;a href=&quot;https://www.youtube.com/watch?v=uIcqeP7MFH0&quot;&gt;importing data&lt;/a&gt; and &lt;a href=&quot;https://www.youtube.com/watch?v=G7oolm0jU8I&quot;&gt;training standard models&lt;/a&gt;. If you follow all the suggestions of the official TensorFlow docs, you should come to the same conclusions I do here.&lt;/p&gt;

&lt;p&gt;It may tempting to quickly write a script that works for your current data and current task, but if you take a little extra time and write generalizable code, you will save yourself headaches in the future. The instructions here will help you easily scale to different datasets and different model architectures.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;pre-requisites&quot;&gt;Pre-requisites&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;A working, new version of TensorFlow installed.&lt;/li&gt;
  &lt;li&gt;Your data in CSV format. The reason I choose CSV data as the starting point is that almost any data can be formatted as a CSV file. Getting your raw data to a CSV file is on you, but once you get there, the rest is smooth sailing:) From CSV data, I show you how to get your data into &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tfrecords&lt;/code&gt; format, which is the prefered TF data format. So, if your data is already in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tfrecords&lt;/code&gt;, you’re already ahead of the curve!&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;install-tensorflow&quot;&gt;Install TensorFlow&lt;/h3&gt;
&lt;p&gt;Just follow the official &lt;a href=&quot;https://www.tensorflow.org/install/&quot;&gt;installation instructions&lt;/a&gt;!&lt;/p&gt;

&lt;h3 id=&quot;get-data-in-csv&quot;&gt;Get Data in CSV&lt;/h3&gt;

&lt;p&gt;To ground this post in a concrete example, below is my own labeled data in CSV. Each training data example is represented as a single row in the CSV file, where the first column represents the label (an integer), and all the following columns contain the features for that example (floating point numbers).&lt;/p&gt;

&lt;p&gt;The labels in the first column represent categories of speech sounds, for example, label &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;45&lt;/code&gt; might be the vowel &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[oh]&lt;/code&gt; and label &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;7&lt;/code&gt; might be the consonant &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[k]&lt;/code&gt;. The features shown in column-two onward correspond to amplitudes at different frequency ranges. In a nutshell, this is speech data, where a snippet of audio (features) has been labeled with a language sound (labels).&lt;/p&gt;

&lt;p&gt;Here’s what four lines of my data CSV file look like (where the delimiter is a single space):&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;95 21.50192 -2.16182 -1.591426 0.06965831 0.6690025  ...  -0.7368361 -1.385849 0.7551874 -0.8878949 -0.4799456 
7 22.23698 -1.177924 -1.368747 -0.6289141 0.009075502  ... -0.9235415 -1.74792 0.2629939 -2.119366 -0.539937
45 22.83421 -0.9043457 -1.591426 -0.816999 -0.3035215  ...  -0.5301266 -1.456303 -0.1479924 -1.641482 -0.04098308 
27 -0.9376022 -0.05841255 0.3308391 -0.7141842 -0.3867566  ...  -1.263647 23.4316 -0.0009118451 -1.035212 -1.635385
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Pretty simple, right? One training example is one line in the CSV file.&lt;/p&gt;

&lt;p&gt;If your data isn’t in this kind of CSV format, you’re going to have to spend a little time to get it here. The most important point is that you need one training example per line, and you should know exactly where each part of the example is located. For my example, I know that the label is the first column, and all the following columns are my features. You also must know how each label/feature is represented. For my case, all the labels are integers, and all the features are floating point numbers. You might have text-based labels or features (e.g. words from text), or you could have categorical features (e.g. you have a feature for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;color&quot;&lt;/code&gt; that you’ve coded as integers &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1&lt;/code&gt; through &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;5&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;Whatever the case, you need to know exactly:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Where your data is (i.e. which column)&lt;/li&gt;
  &lt;li&gt;How your data is coded (e.g. float vs. integer vs. text)&lt;/li&gt;
  &lt;li&gt;What your data means (e.g. the integer entry &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;43&lt;/code&gt; in column 5 corresponds to the color &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;blue&lt;/code&gt;)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The last point is very important, because you might have integers whose numerical distance doesn’t correspond to anything meaningful (e.g. the distance between &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;3&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;7&lt;/code&gt; means nothing if &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;3&lt;/code&gt; is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;orange&quot;&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;7&lt;/code&gt; is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;magenta&quot;&lt;/code&gt;). On the other hand, you might have integers where the distance between them is very important (e.g. the score on a test &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;99&lt;/code&gt; is much better than a grade of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;59&lt;/code&gt;). The distance between test grades is meaningful, but the distance between colors is not.&lt;/p&gt;

&lt;p&gt;In what follows, you have to decide how to represent your values, and whether or not their distances matter.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;convert-csv-to-tfrecords&quot;&gt;Convert CSV to TFRecords&lt;/h2&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TFRecords&lt;/code&gt; is the preferred file format for TensorFlow. These &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tfrecords&lt;/code&gt; files take up a lot of space on disk, but they can be easily sharded and processed across machines, and the entire TensorFlow pipeline is optimized with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tfrecords&lt;/code&gt; in mind.&lt;/p&gt;

&lt;p&gt;To work with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tfrecords&lt;/code&gt; data, you have to first format your CSV data using TensorFlow itself. We have to read in the CSV file one example at a time, and format it as a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tf.train.Example&lt;/code&gt; example, and then print that example to a file on disk. Each &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tf.train.Example&lt;/code&gt; stores information about that particular example via so-called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;features&lt;/code&gt;, where these &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;features&lt;/code&gt; can be anything (including the target label!). You will store each &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Example&lt;/code&gt;’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;feature&lt;/code&gt; as an item in a dictionary, where the key should be descriptive. You can see in the following example I have chosen the keys &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;label&quot;&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;feats&quot;&lt;/code&gt; to be make sure I won’t mix them up.&lt;/p&gt;

&lt;p&gt;Below is an example Python script to read in a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.csv&lt;/code&gt; data file and save to a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.tfrecords&lt;/code&gt; file. You can find the original version of the following &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;csv-to-records.py&lt;/code&gt; &lt;a href=&quot;https://github.com/JRMeyer/kaldi-tf/blob/master/csv-to-tfrecords.py&quot;&gt;here&lt;/a&gt;. There are faster ways to do this (i.e. via parallelization), but I want to give you working code which is as readable as possible.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;tensorflow&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;numpy&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pandas&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sys&lt;/span&gt;


&lt;span class=&quot;c1&quot;&gt;# 
# USAGE: $ python3 csv-to-tfrecords.py data.csv data.tfrecords
#
&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;infile&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sys&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;argv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;outfile&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sys&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;argv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;csv&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pandas&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_csv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;infile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;header&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;values&lt;/span&gt;


&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;python_io&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;TFRecordWriter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;outfile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;writer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;row&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;csv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;## READ FROM CSV ##
&lt;/span&gt;        
        &lt;span class=&quot;c1&quot;&gt;# row is read as a single char string of the label and all my floats, so remove trailing whitespace and split
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;row&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;row&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rstrip&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos; &apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# the first col is label, all rest are feats
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;label&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;row&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# convert each floating point feature from char to float to bytes
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;feats&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;array&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;feat&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;feat&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;row&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:]&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;]).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tostring&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

        &lt;span class=&quot;c1&quot;&gt;## SAVE TO TFRECORDS ##
&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# A tfrecords file is made up of tf.train.Example objects, and each of these
&lt;/span&gt;        &lt;span class=&quot;c1&quot;&gt;# tf.train.Examples contains one or more &quot;features&quot;
&lt;/span&gt;        &lt;span class=&quot;c1&quot;&gt;# use SequenceExample if you&apos;ve got sequential data
&lt;/span&gt;        
        &lt;span class=&quot;n&quot;&gt;example&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Example&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;example&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;features&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;feature&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;feats&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bytes_list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;feats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;example&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;features&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;feature&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;label&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;int64_list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;label&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;writer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;write&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;example&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SerializeToString&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())&lt;/span&gt;
        &lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;There’s a good amount of resources on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tfrecords&lt;/code&gt; out there, check out the official docs on &lt;a href=&quot;https://www.tensorflow.org/versions/r1.0/programmers_guide/reading_data#file_formats&quot;&gt;reading data&lt;/a&gt;, &lt;a href=&quot;https://www.tensorflow.org/versions/r1.0/api_guides/python/python_io#tfrecords_format_details&quot;&gt;Python-IO&lt;/a&gt;, and &lt;a href=&quot;https://www.tensorflow.org/programmers_guide/datasets#consuming_tfrecord_data&quot;&gt;importing data&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;pat-on-the-back&quot;&gt;Pat on the Back&lt;/h2&gt;

&lt;p&gt;If you’ve gotten to this point, you have successfully converted your data and saved it as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TFRecords&lt;/code&gt; format. Take a pause and pat yourself on the back, because you’ve accomplished the most time-consuming and boring part of machine learning: data formatting.&lt;/p&gt;

&lt;p&gt;Now that you have your data in a format TensorFlow likes, we can import that data and train some models. Before we jump straight into training code, you’ll want a little background on TensorFlow’s awesome APIs for working with data and models: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tf.data&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tf.estimator&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;datasets-and-estimators&quot;&gt;Datasets and Estimators&lt;/h2&gt;

&lt;p&gt;The official TensorFlow docs push hard for you to use their &lt;a href=&quot;https://www.tensorflow.org/api_docs/python/tf/data/Dataset&quot;&gt;Dataset&lt;/a&gt; and &lt;a href=&quot;https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator&quot;&gt;Estimator&lt;/a&gt; APIs. In general, if the docs explicitly tell you there is a preferred way to do something, you should do that because all the newest features will surely work for this format but maybe not others.&lt;/p&gt;

&lt;h3 id=&quot;dataset-api&quot;&gt;Dataset API&lt;/h3&gt;

&lt;h3 id=&quot;tfdatadataset&quot;&gt;&lt;a href=&quot;https://www.tensorflow.org/api_docs/python/tf/data/Dataset&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tf.data.Dataset&lt;/code&gt;&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Dataset&lt;/code&gt; Class allows you to easily import, shuffle, transform, and batch your data. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Dataset&lt;/code&gt; API makes any pre-processing operation on your data just another part of the pipeline, and it’s optimized for large, distributed datasets. Your entire pre-processing pipeline can be as simple as this:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;n&quot;&gt;dataset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;TFRecordDataset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;/your/path/to/data/my-data.tfrecords&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parser&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shuffle&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;buffer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1024&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;In the above definition of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dataset&lt;/code&gt;, you can see there’s a line where you point TensorFlow to your data on disc, and read the data via &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tf.data.TFRecordDataset&lt;/code&gt;. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.shuffle()&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.batch()&lt;/code&gt; functions are optional, but you will need the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.map()&lt;/code&gt; function.&lt;/p&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.map()&lt;/code&gt; function provides the methods for parsing your data into meaningful pieces like “labels” and “features”. However, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.map()&lt;/code&gt; is a super general function, and it doesn’t know anything about your data, so we have to pass a special parsing function which &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.map()&lt;/code&gt; then applies to the data. This &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;parser&lt;/code&gt; function is probably the main thing you have to create for your own dataset, and it should exactly mirror the way you saved your data to TFRecords above with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tf.Example&lt;/code&gt; object (in the &lt;a href=&quot;#get-data-in-csv&quot;&gt;data formatting section&lt;/a&gt; above). Read more about parser functions in &lt;a href=&quot;https://www.tensorflow.org/programmers_guide/datasets#preprocessing_data_with_datasetmap&quot;&gt;the official docs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The above is one of the simplest ways to load, shuffle, and batch your data, but it is not the fastest way. For tips on speeding this stage up, take a look &lt;a href=&quot;https://stackoverflow.com/questions/50927298/faster-k-means-clustering-in-tensorflow/51030160#51030160&quot;&gt;here&lt;/a&gt; and &lt;a href=&quot;https://www.tensorflow.org/performance/datasets_performance&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here’s an example of such a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;parser&lt;/code&gt; function:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;parser&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;record&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&apos;&apos;&apos;
    This is a parser function. It defines the template for
    interpreting the examples you&apos;re feeding in. Basically, 
    this function defines what the labels and data look like
    for your labeled data. 
    &apos;&apos;&apos;&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# the &apos;features&apos; here include your normal data feats along
&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;# with the label for that data
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;features&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;&apos;feats&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FixedLenFeature&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;&apos;label&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FixedLenFeature&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;int64&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;parsed&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parse_single_example&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;record&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;features&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# some conversion and casting to get from bytes to floats and ints
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;feats&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;convert_to_tensor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;decode_raw&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parsed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;feats&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;float64&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;label&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parsed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;label&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;int32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# since you can have multiple kinds of feats, you return a dictionary for feats
&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;# but only an int for the label
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;feats&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;feats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;label&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;To get into the details of this function and how you can define one for your data, take a look at the &lt;a href=&quot;https://www.tensorflow.org/programmers_guide/datasets#preprocessing_data_with_datasetmap&quot;&gt;official parse function docs&lt;/a&gt;. Remember that if you have labeled training data, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;features&lt;/code&gt; definition above includes the data features (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;feats&lt;/code&gt;) as well as the labels (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;label&lt;/code&gt;). If you’re doing something like k-means clustering (where labels aren’t used), you won’t return a label.&lt;/p&gt;

&lt;h3 id=&quot;estimator-api&quot;&gt;Estimator API&lt;/h3&gt;

&lt;h3 id=&quot;tfestimatorestimator&quot;&gt;&lt;a href=&quot;https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tf.estimator.Estimator&lt;/code&gt;&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Estimator&lt;/code&gt; class gives you an API for interaction with your model. Here’s a good overview from &lt;a href=&quot;https://www.tensorflow.org/programmers_guide/estimators&quot;&gt;the official docs&lt;/a&gt;. It’s like a wrapper for a model which allows you to train, evaluate, and export the model as well as make inferences on new data. Usually you won’t be interacting directly with the base class &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tf.estimator.Estimator&lt;/code&gt;, but rather with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Estimator&lt;/code&gt; Classes which directly inherit from it, such as the &lt;a href=&quot;https://www.tensorflow.org/api_docs/python/tf/estimator/DNNClassifier&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DNNClassifier&lt;/code&gt;&lt;/a&gt; Class. There are a whole set of pre-defined, easy to use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Estimator&lt;/code&gt;s which you can start working with out of the box, such as &lt;a href=&quot;https://www.tensorflow.org/api_docs/python/tf/estimator/LinearRegressor&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LinearRegressor&lt;/code&gt;&lt;/a&gt; or &lt;a href=&quot;https://www.tensorflow.org/api_docs/python/tf/estimator/BoostedTreesClassifier&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BoostedTreesClassifier&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can instantiate an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Estimator&lt;/code&gt; object with minimal, readable code. If you decide to use the pre-existing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Estimator&lt;/code&gt;s from TensorFlow (i.e. “pre-canned” models), you can get started without digging any deeper than the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;__init__()&lt;/code&gt; function! I’ve defined a 4-layer Deep Neural Network which accepts as input my input data (377-dimensional feature vectors) and predicts one of my 96 classes as such:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;n&quot;&gt;DNNClassifier&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;estimator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DNNClassifier&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;

   &lt;span class=&quot;c1&quot;&gt;# for a DNN, this feature_columns object is really just a definition
&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;# of the input layer
&lt;/span&gt;   &lt;span class=&quot;n&quot;&gt;feature_columns&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;feature_column&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;numeric_column&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;feats&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                                                       &lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;377&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,),&lt;/span&gt;
                                                       &lt;span class=&quot;n&quot;&gt;dtype&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;float64&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)],&lt;/span&gt;

   &lt;span class=&quot;c1&quot;&gt;# four hidden layers with 256 nodes in each layer
&lt;/span&gt;   &lt;span class=&quot;n&quot;&gt;hidden_units&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;256&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;256&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;256&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;256&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
   
   &lt;span class=&quot;c1&quot;&gt;# number of classes (aka number of nodes on the output layer)
&lt;/span&gt;   &lt;span class=&quot;n&quot;&gt;n_classes&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;96&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;

&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;We’ve just defined a new DNN Classifier with an input layer (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;feature_columns&lt;/code&gt;), four hidden layers (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hidden_units&lt;/code&gt;), and an output layer (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;n_classes&lt;/code&gt;). Pretty easy, yeah?&lt;/p&gt;

&lt;p&gt;You will probably agree that each of these three arguments is very clear expect for maybe the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;feature_columns&lt;/code&gt; argument. You can think of “feature_columns” as being identical to “input_layer”. However, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;feature_columns&lt;/code&gt; allows you to do a whole lot of pre-processing that a traditional input layer would never allow. The &lt;a href=&quot;https://www.tensorflow.org/guide/feature_columns&quot;&gt;official documentation on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;feature_columns&lt;/code&gt;&lt;/a&gt; is really good, and you should take a look. In a nutshell, think of these &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;feature_columns&lt;/code&gt; as a set of instructions for how to squeeze your raw data into the right shape for a neural net (or whatever model you’re training). Neural nets cannot take as input words, intergers, or anything else that isn’t a floating point number.&lt;/p&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;feature_columns&lt;/code&gt; API helps you not only get your data into floats, but it helps you find floats that actually make sense for your task at hand. You can easily encode words or categories as one-hot vectors, but one-hot vectors are not practical if you have a billion different words in your data. Instead of using &lt;a href=&quot;https://www.tensorflow.org/api_docs/python/tf/feature_column/categorical_column_with_vocabulary_list&quot;&gt;one-hot vector &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;feature_columns&lt;/code&gt;&lt;/a&gt;, you can use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;feature_column&lt;/code&gt; type &lt;a href=&quot;https://www.tensorflow.org/api_docs/python/tf/feature_column/embedding_column&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;embedding_column&lt;/code&gt;&lt;/a&gt; to find a lower-dimensional representation of your data. In the example above, I use the &lt;a href=&quot;https://www.tensorflow.org/api_docs/python/tf/feature_column/numeric_column&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;feature_column.numeric_column&lt;/code&gt;&lt;/a&gt; because my input data is already encoded as floating point numbers.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;putting-it-all-together&quot;&gt;Putting it All Together&lt;/h2&gt;

&lt;p&gt;Below is an example of minimal code you need for importing a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tfrecords&lt;/code&gt; file, training a model, and making predictions on new data.&lt;/p&gt;

&lt;h3 id=&quot;parser_fn&quot;&gt;parser_fn&lt;/h3&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;parser&lt;/code&gt; function will be the most data-specific part of your code. Learn about how to make a good function &lt;a href=&quot;https://www.tensorflow.org/programmers_guide/datasets#preprocessing_data_with_datasetmap&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;parser&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;record&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;

  &lt;span class=&quot;n&quot;&gt;features&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&apos;feats&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FixedLenFeature&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&apos;label&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FixedLenFeature&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;int64&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  
  &lt;span class=&quot;n&quot;&gt;parsed&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parse_single_example&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;record&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;features&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;feats&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;convert_to_tensor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;decode_raw&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parsed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;feats&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;float64&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;label&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parsed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;label&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;int32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;feats&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;feats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;label&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h3 id=&quot;input_fn&quot;&gt;input_fn&lt;/h3&gt;

&lt;p&gt;This is an Estimator input function. It defines things like datasets and batches, and can perform operations such as shuffling. Both the dataset and dataset iterator are defined here. Read more about how to make a good &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;input_fn&lt;/code&gt; on &lt;a href=&quot;https://www.tensorflow.org/versions/r1.3/get_started/input_fn&quot;&gt;the official docs&lt;/a&gt;.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;my_input_fn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tfrecords_path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;

  &lt;span class=&quot;n&quot;&gt;dataset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;TFRecordDataset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tfrecords_path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parser&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1024&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
  
  &lt;span class=&quot;n&quot;&gt;iterator&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dataset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;make_one_shot_iterator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

  &lt;span class=&quot;n&quot;&gt;batch_feats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;batch_labels&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;iterator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_next&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;batch_feats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;batch_labels&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h3 id=&quot;estimator&quot;&gt;Estimator&lt;/h3&gt;

&lt;p&gt;To get started fast, just choose an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Estimator&lt;/code&gt; from the &lt;a href=&quot;https://www.tensorflow.org/api_docs/python/tf/estimator&quot;&gt;available pre-made Estimators&lt;/a&gt;. For more detail on how to use pre-made &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Estimators&lt;/code&gt; in general, check out &lt;a href=&quot;https://www.tensorflow.org/guide/premade_estimators&quot;&gt;the official docs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you want a custom architecture which is not pre-made, you can &lt;a href=&quot;https://www.tensorflow.org/guide/custom_estimators&quot;&gt;build your own Estimator&lt;/a&gt;.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;n&quot;&gt;DNNClassifier&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;estimator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DNNClassifier&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;feature_columns&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;feature_column&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;numeric_column&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;feats&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dtype&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;float64&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;377&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,))],&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;hidden_units&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;256&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;256&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;256&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;256&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;n_classes&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;96&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;model_dir&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;/tmp/tf&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h3 id=&quot;train--eval-specs&quot;&gt;Train &amp;amp; Eval Specs&lt;/h3&gt;

&lt;p&gt;Defining the training and evaluation routine for your model is easy with &lt;a href=&quot;https://www.tensorflow.org/api_docs/python/tf/estimator/TrainSpec&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TrainSpec&lt;/code&gt;&lt;/a&gt; and &lt;a href=&quot;https://www.tensorflow.org/api_docs/python/tf/estimator/EvalSpec&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;EvalSpec&lt;/code&gt;&lt;/a&gt;. These two classes allow you to combine your model with your data along with instructions on how to combine them.&lt;/p&gt;

&lt;p&gt;After you’ve defined the specs, you feed them to the specialized function &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tf.estimator.train_and_evaluate&lt;/code&gt; which nicely handles all the heavy lifting. The Google Cloud folks wrote &lt;a href=&quot;https://cloud.google.com/blog/big-data/2018/02/easy-distributed-training-with-tensorflow-using-tfestimatortrain-and-evaluate-on-cloud-ml-engine&quot;&gt;a very nice blog post&lt;/a&gt; on how to get best use of this function as well as your specs.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;n&quot;&gt;train_spec_dnn&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;estimator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;TrainSpec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;input_fn&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_input_fn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;/home/ubuntu/train.tfrecords&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_steps&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;eval_spec_dnn&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;estimator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;EvalSpec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;input_fn&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_input_fn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;/home/ubuntu/eval.tfrecords&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;estimator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;train_and_evaluate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DNNClassifier&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;train_spec_dnn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;eval_spec_dnn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h3 id=&quot;inference&quot;&gt;Inference&lt;/h3&gt;

&lt;p&gt;Finally, to make predictions on new data, just use the &lt;a href=&quot;https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator#predict&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.predict()&lt;/code&gt;&lt;/a&gt; method which is available to all &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Estimators&lt;/code&gt;.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;n&quot;&gt;predictions&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DNNClassifier&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;input_fn&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_input_fn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;/home/ubuntu/test.tfrecords&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;conclusions&quot;&gt;Conclusions&lt;/h2&gt;

&lt;p&gt;I hope you’ve found this post helpful!&lt;/p&gt;

&lt;p&gt;Feel free to leave questions and comments below!&lt;/p&gt;

</description>
        <pubDate>Wed, 29 May 2019 00:00:00 +0000</pubDate>
        <link>http://jrmeyer.github.io/machinelearning/2019/05/29/tensorflow-dataset-estimator-api.html</link>
        <guid isPermaLink="true">http://jrmeyer.github.io/machinelearning/2019/05/29/tensorflow-dataset-estimator-api.html</guid>
        
        
        <category>MachineLearning</category>
        
      </item>
    
      <item>
        <title>Some Linux Text Processing Notes</title>
        <description>&lt;h2 id=&quot;cut&quot;&gt;&lt;a href=&quot;https://linux.die.net/man/1/cut&quot;&gt;cut&lt;/a&gt;&lt;/h2&gt;

&lt;h3 id=&quot;edit-single-column-of-csv-file&quot;&gt;Edit single column of csv file&lt;/h3&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;c&quot;&gt;# the following takes the first two columns of a CSV file (FILE), and performs a sed cleaning on the second&lt;/span&gt;

&lt;span class=&quot;nb&quot;&gt;paste&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;,&quot;&lt;/span&gt; &amp;lt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;cut&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;,&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-f1&lt;/span&gt; FILE &lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &amp;lt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;cut&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;,&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-f2&lt;/span&gt; FILE | &lt;span class=&quot;nb&quot;&gt;sed&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-e&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;s/from/to/g&apos;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;OUTPUT&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h3 id=&quot;edit-column-of-single-file-skipping-header&quot;&gt;Edit column of single file, skipping header&lt;/h3&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;paste&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;,&quot;&lt;/span&gt; &amp;lt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;cut&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;,&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-f1&lt;/span&gt;,2 &lt;span class=&quot;nv&quot;&gt;$INFILE&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&amp;lt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
   &lt;span class=&quot;nb&quot;&gt;cat&lt;/span&gt; &amp;lt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;head&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-1&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$INFILE&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;cut&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;,&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-f3-&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
   &amp;lt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;tail&lt;/span&gt; +2 &lt;span class=&quot;nv&quot;&gt;$INFILE&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;cut&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;,&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-f3-&lt;/span&gt; | ./replace-bytes.sh &lt;span class=&quot;nv&quot;&gt;$NUM_BYTES&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
   &lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$OUTFILE&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h2 id=&quot;sed&quot;&gt;&lt;a href=&quot;https://linux.die.net/man/1/sed&quot;&gt;sed&lt;/a&gt;&lt;/h2&gt;

&lt;h3 id=&quot;lowercase&quot;&gt;Lowercase&lt;/h3&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;sed&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-e&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;s/\(.*\)/\L\1/g&apos;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h3 id=&quot;remove-punctuation&quot;&gt;Remove punctuation&lt;/h3&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;sed&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-e&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;s/[[:punct:]]\+//g&apos;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h2 id=&quot;grep&quot;&gt;&lt;a href=&quot;https://linux.die.net/man/1/grep&quot;&gt;grep&lt;/a&gt;&lt;/h2&gt;

&lt;h3 id=&quot;get-unique-characters-in-file&quot;&gt;Get unique characters in file&lt;/h3&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;grep&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;.&quot;&lt;/span&gt; FILE | &lt;span class=&quot;nb&quot;&gt;sort&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;uniq&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h2 id=&quot;awk&quot;&gt;&lt;a href=&quot;https://linux.die.net/man/1/awk&quot;&gt;awk&lt;/a&gt;&lt;/h2&gt;

&lt;h3 id=&quot;sort-lines-in-file-by-length-in-characters&quot;&gt;Sort lines in file by length in characters&lt;/h3&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;cat &lt;/span&gt;FILE | &lt;span class=&quot;nb&quot;&gt;awk&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;{ print length, $0 }&apos;&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;sort&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-n&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-s&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;cut&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot; &quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-f2-&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

</description>
        <pubDate>Sat, 02 Mar 2019 00:00:00 +0000</pubDate>
        <link>http://jrmeyer.github.io/misc/2019/03/02/Linux-textProc-Notes.html</link>
        <guid isPermaLink="true">http://jrmeyer.github.io/misc/2019/03/02/Linux-textProc-Notes.html</guid>
        
        
        <category>misc</category>
        
      </item>
    
  </channel>
</rss>
