Mirek Długosz personal website - tutorialhttps://mirekdlugosz.com/2024-03-12T17:57:46+01:00Don’t blindly serve WebP format2024-03-12T17:57:46+01:002024-03-12T17:57:46+01:00Mirek Długosztag:mirekdlugosz.com,2024-03-12:/blog/2024/dont-blindly-serve-webp-format/<p>If you have done any webdev work in last few years, you must have heard about WebP. It’s an image format that promises up to 34% smaller file sizes without noticeable quality downgrade. It’s pretty much universally supported since late 2020.</p>
<p>With smaller file sizes and widespread support …</p><p>If you have done any webdev work in last few years, you must have heard about WebP. It’s an image format that promises up to 34% smaller file sizes without noticeable quality downgrade. It’s pretty much universally supported since late 2020.</p>
<p>With smaller file sizes and widespread support, you might think it’s a good idea to just serve all your images in WebP. Or, if you want to be extra backward-compatible - serve WebP to all browsers that claim to support it, and original image to remaining few.</p>
<p>I also thought it’s a good idea, and made a switch on this very website. I’m creating WebP with a compression factor set to 80. And then I noticed that one file is actually larger after conversion.</p>
<p>Following this thread, I compared size of all WebP images with their original counterparts. Turned out WebP produced larger files in 3% (4 out of 117). In worst case, 20 kB <span class="caps">PNG</span> file turned into 122 kB WebP - over sixfold increase in size!</p>
<p>Since then, when I generate WebP, I compare file size to original and keep it only if new format produces smaller file. This way browsers will always receive the smallest file I can produce, regardless of the format.</p>
<p>I guess the main takeaway here is ages old “measure before optimizing”.</p>Playwright - accessing page object in event handler2024-01-03T18:37:03+01:002024-01-03T18:37:03+01:00Mirek Długosztag:mirekdlugosz.com,2024-01-03:/blog/2024/playwright-accessing-page-object-in-event-handler/<p>Playwright <a href="https://playwright.dev/python/docs/api/class-page#events">exposes a number of browser events</a> and provides a mechanism to respond to them. Since many of these events signal errors and problems, most of the time you want to log them, halt program execution, or ignore and move on. Logging is also shown in <a href="https://playwright.dev/python/docs/network#network-events">Playwright documentation about network …</a></p><p>Playwright <a href="https://playwright.dev/python/docs/api/class-page#events">exposes a number of browser events</a> and provides a mechanism to respond to them. Since many of these events signal errors and problems, most of the time you want to log them, halt program execution, or ignore and move on. Logging is also shown in <a href="https://playwright.dev/python/docs/network#network-events">Playwright documentation about network</a>, which I will use as a base for examples in this article.</p>
<h2 id="problem-statement"><a class="toclink" href="#problem-statement">Problem statement</a></h2>
<p>Documentation shows event handlers created with <code>lambda</code> expressions, but <code>lambda</code> poses significant problems once you leave the territory of toy examples:</p>
<ul>
<li>they should fit in single line of code</li>
<li>you can’t share them across modules</li>
<li>you can’t unit test them in isolation</li>
</ul>
<p>Usually you want to define event handlers as normal functions. But when you attempt that, you might run into another problem - Playwright invokes event handler with some event-related data, that data does not contain any reference back to <code>page</code> object, and <code>page</code> object might contain some important contextual information.</p>
<p>In other words, we would like to do something similar to code below. Note that this example does not work - if you run it, you will get <code>NameError: name 'page' is not defined</code>. </p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">playwright.sync_api</span> <span class="kn">import</span> <span class="n">sync_playwright</span>
<span class="kn">from</span> <span class="nn">playwright.sync_api</span> <span class="kn">import</span> <span class="n">Playwright</span>
<span class="k">def</span> <span class="nf">request_handler</span><span class="p">(</span><span class="n">request</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">page</span><span class="o">.</span><span class="n">url</span><span class="si">}</span><span class="s2"> issued request: </span><span class="si">{</span><span class="n">request</span><span class="o">.</span><span class="n">method</span><span class="si">}</span><span class="s2"> </span><span class="si">{</span><span class="n">request</span><span class="o">.</span><span class="n">url</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">response_handler</span><span class="p">(</span><span class="n">response</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">page</span><span class="o">.</span><span class="n">url</span><span class="si">}</span><span class="s2"> received response: </span><span class="si">{</span><span class="n">response</span><span class="o">.</span><span class="n">status</span><span class="si">}</span><span class="s2"> </span><span class="si">{</span><span class="n">response</span><span class="o">.</span><span class="n">url</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">run_test</span><span class="p">(</span><span class="n">playwright</span><span class="p">:</span> <span class="n">Playwright</span><span class="p">):</span>
<span class="n">browser</span> <span class="o">=</span> <span class="n">playwright</span><span class="o">.</span><span class="n">chromium</span><span class="o">.</span><span class="n">launch</span><span class="p">()</span>
<span class="n">page</span> <span class="o">=</span> <span class="n">browser</span><span class="o">.</span><span class="n">new_page</span><span class="p">()</span>
<span class="n">page</span><span class="o">.</span><span class="n">goto</span><span class="p">(</span><span class="s2">"https://mirekdlugosz.com"</span><span class="p">)</span>
<span class="n">page</span><span class="o">.</span><span class="n">on</span><span class="p">(</span><span class="s2">"request"</span><span class="p">,</span> <span class="n">request_handler</span><span class="p">)</span>
<span class="n">page</span><span class="o">.</span><span class="n">on</span><span class="p">(</span><span class="s2">"response"</span><span class="p">,</span> <span class="n">response_handler</span><span class="p">)</span>
<span class="n">page</span><span class="o">.</span><span class="n">goto</span><span class="p">(</span><span class="s2">"https://httpbin.org/status/404"</span><span class="p">)</span>
<span class="n">browser</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="k">with</span> <span class="n">sync_playwright</span><span class="p">()</span> <span class="k">as</span> <span class="n">playwright</span><span class="p">:</span>
<span class="n">run_test</span><span class="p">(</span><span class="n">playwright</span><span class="p">)</span>
</code></pre></div>
<p>I can think of three ways of solving that: by defining a function inside a function, with <code>functools.partial</code>and with a factory function. Let’s take a look at all of them.</p>
<h2 id="defining-a-function-inside-a-function"><a class="toclink" href="#defining-a-function-inside-a-function">Defining a function inside a function</a></h2>
<p>Most Python users are so used to defining functions at the top level of module or inside a class (we call these “methods”) that they might consider function definitions to be somewhat special. In fact, some other programming languages do encumber where functions can be defined. But in Python you can define them anywhere, including inside other functions.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">playwright.sync_api</span> <span class="kn">import</span> <span class="n">sync_playwright</span>
<span class="kn">from</span> <span class="nn">playwright.sync_api</span> <span class="kn">import</span> <span class="n">Playwright</span>
<span class="k">def</span> <span class="nf">run_test</span><span class="p">(</span><span class="n">playwright</span><span class="p">:</span> <span class="n">Playwright</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">request_handler</span><span class="p">(</span><span class="n">request</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">page</span><span class="o">.</span><span class="n">url</span><span class="si">}</span><span class="s2"> issued request: </span><span class="si">{</span><span class="n">request</span><span class="o">.</span><span class="n">method</span><span class="si">}</span><span class="s2"> </span><span class="si">{</span><span class="n">request</span><span class="o">.</span><span class="n">url</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">response_handler</span><span class="p">(</span><span class="n">response</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">page</span><span class="o">.</span><span class="n">url</span><span class="si">}</span><span class="s2"> received response: </span><span class="si">{</span><span class="n">response</span><span class="o">.</span><span class="n">status</span><span class="si">}</span><span class="s2"> </span><span class="si">{</span><span class="n">response</span><span class="o">.</span><span class="n">url</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="n">browser</span> <span class="o">=</span> <span class="n">playwright</span><span class="o">.</span><span class="n">chromium</span><span class="o">.</span><span class="n">launch</span><span class="p">()</span>
<span class="n">page</span> <span class="o">=</span> <span class="n">browser</span><span class="o">.</span><span class="n">new_page</span><span class="p">()</span>
<span class="n">page</span><span class="o">.</span><span class="n">goto</span><span class="p">(</span><span class="s2">"https://mirekdlugosz.com"</span><span class="p">)</span>
<span class="n">page</span><span class="o">.</span><span class="n">on</span><span class="p">(</span><span class="s2">"request"</span><span class="p">,</span> <span class="n">request_handler</span><span class="p">)</span>
<span class="n">page</span><span class="o">.</span><span class="n">on</span><span class="p">(</span><span class="s2">"response"</span><span class="p">,</span> <span class="n">response_handler</span><span class="p">)</span>
<span class="n">page</span><span class="o">.</span><span class="n">goto</span><span class="p">(</span><span class="s2">"https://httpbin.org/status/404"</span><span class="p">)</span>
<span class="n">browser</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="k">with</span> <span class="n">sync_playwright</span><span class="p">()</span> <span class="k">as</span> <span class="n">playwright</span><span class="p">:</span>
<span class="n">run_test</span><span class="p">(</span><span class="n">playwright</span><span class="p">)</span>
</code></pre></div>
<p>This works because <a href="https://docs.python.org/3/reference/compound_stmts.html#function-definitions">function body is not evaluated until function is called</a> and <a href="https://peps.python.org/pep-0227/">functions have access to names defined in their encompassing scope</a>. So Python will look up <code>page</code> only when event handler is invoked by Playwright; since it’s not defined in function itself, Python will look for it in the function where event handler was defined (and then next function, if there is one, then module and eventually builtins).</p>
<p>I think this solution solves the most important part of the problem - it allows to write event handlers that span multiple lines. Technically it is also possible to share these handlers across modules, but you won’t see that often. They can’t be unit tested in isolation, as they depend on their parent function.</p>
<h2 id="functoolspartial"><a class="toclink" href="#functoolspartial"><code>functools.partial</code></a></h2>
<p><a href="https://docs.python.org/3/library/functools.html#functools.partial"><code>functools.partial</code> documentation</a> may be confusing, as prose sounds exactly like a description of standard function, code equivalent assumes pretty good understanding of Python internals, and provided example seems completely unnecessary.</p>
<p>I think about <code>partial</code> this way: it creates a function that has some of the arguments already filled in.</p>
<p>To be fair, <code>partial</code> is rarely <em>needed</em>. It allows to write shorter code, as you don’t have to repeat the same arguments over and over again. It may also allow you to provide saner library <span class="caps">API</span> - you can define single generic and flexible function with a lot of arguments, and few helper functions intended for external use, each with a small number of arguments.</p>
<p>But it’s invaluable when you have to provide your own function, but you don’t have control over arguments it will receive. Which is <em>exactly</em> the problem we are facing.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">functools</span> <span class="kn">import</span> <span class="n">partial</span>
<span class="kn">from</span> <span class="nn">playwright.sync_api</span> <span class="kn">import</span> <span class="n">sync_playwright</span>
<span class="kn">from</span> <span class="nn">playwright.sync_api</span> <span class="kn">import</span> <span class="n">Playwright</span>
<span class="k">def</span> <span class="nf">request_handler</span><span class="p">(</span><span class="n">request</span><span class="p">,</span> <span class="n">page</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">page</span><span class="o">.</span><span class="n">url</span><span class="si">}</span><span class="s2"> issued request: </span><span class="si">{</span><span class="n">request</span><span class="o">.</span><span class="n">method</span><span class="si">}</span><span class="s2"> </span><span class="si">{</span><span class="n">request</span><span class="o">.</span><span class="n">url</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">response_handler</span><span class="p">(</span><span class="n">response</span><span class="p">,</span> <span class="n">page</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">page</span><span class="o">.</span><span class="n">url</span><span class="si">}</span><span class="s2"> received response: </span><span class="si">{</span><span class="n">response</span><span class="o">.</span><span class="n">status</span><span class="si">}</span><span class="s2"> </span><span class="si">{</span><span class="n">response</span><span class="o">.</span><span class="n">url</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">run_test</span><span class="p">(</span><span class="n">playwright</span><span class="p">:</span> <span class="n">Playwright</span><span class="p">):</span>
<span class="n">browser</span> <span class="o">=</span> <span class="n">playwright</span><span class="o">.</span><span class="n">chromium</span><span class="o">.</span><span class="n">launch</span><span class="p">()</span>
<span class="n">page</span> <span class="o">=</span> <span class="n">browser</span><span class="o">.</span><span class="n">new_page</span><span class="p">()</span>
<span class="n">page</span><span class="o">.</span><span class="n">goto</span><span class="p">(</span><span class="s2">"https://mirekdlugosz.com"</span><span class="p">)</span>
<span class="n">local_request_handler</span> <span class="o">=</span> <span class="n">partial</span><span class="p">(</span><span class="n">request_handler</span><span class="p">,</span> <span class="n">page</span><span class="o">=</span><span class="n">page</span><span class="p">)</span>
<span class="n">local_response_handler</span> <span class="o">=</span> <span class="n">partial</span><span class="p">(</span><span class="n">response_handler</span><span class="p">,</span> <span class="n">page</span><span class="o">=</span><span class="n">page</span><span class="p">)</span>
<span class="n">page</span><span class="o">.</span><span class="n">on</span><span class="p">(</span><span class="s2">"request"</span><span class="p">,</span> <span class="n">local_request_handler</span><span class="p">)</span>
<span class="n">page</span><span class="o">.</span><span class="n">on</span><span class="p">(</span><span class="s2">"response"</span><span class="p">,</span> <span class="n">local_response_handler</span><span class="p">)</span>
<span class="n">page</span><span class="o">.</span><span class="n">goto</span><span class="p">(</span><span class="s2">"https://httpbin.org/status/404"</span><span class="p">)</span>
<span class="n">browser</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="k">with</span> <span class="n">sync_playwright</span><span class="p">()</span> <span class="k">as</span> <span class="n">playwright</span><span class="p">:</span>
<span class="n">run_test</span><span class="p">(</span><span class="n">playwright</span><span class="p">)</span>
</code></pre></div>
<p>Notice that our function takes the same arguments as Playwright event handler, and then some. When it’s time to assign event handlers, we use <code>partial</code> to create a new function, one that only needs argument that we will receive from Playwright - the other one is already filled in. But when function is executed, it will receive both arguments.</p>
<h2 id="factory-function"><a class="toclink" href="#factory-function">Factory function</a></h2>
<p>Functions in Python may not only define other functions in their bodies, but also return functions. They are called “higher-order functions” and aren’t used often, with one notable exception of <a href="https://realpython.com/primer-on-python-decorators/">decorators</a>.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">playwright.sync_api</span> <span class="kn">import</span> <span class="n">sync_playwright</span>
<span class="kn">from</span> <span class="nn">playwright.sync_api</span> <span class="kn">import</span> <span class="n">Playwright</span>
<span class="k">def</span> <span class="nf">request_handler_factory</span><span class="p">(</span><span class="n">page</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">inner</span><span class="p">(</span><span class="n">request</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">page</span><span class="o">.</span><span class="n">url</span><span class="si">}</span><span class="s2"> issued request: </span><span class="si">{</span><span class="n">request</span><span class="o">.</span><span class="n">method</span><span class="si">}</span><span class="s2"> </span><span class="si">{</span><span class="n">request</span><span class="o">.</span><span class="n">url</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="k">return</span> <span class="n">inner</span>
<span class="k">def</span> <span class="nf">response_handler_factory</span><span class="p">(</span><span class="n">page</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">inner</span><span class="p">(</span><span class="n">response</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">page</span><span class="o">.</span><span class="n">url</span><span class="si">}</span><span class="s2"> received response: </span><span class="si">{</span><span class="n">response</span><span class="o">.</span><span class="n">status</span><span class="si">}</span><span class="s2"> </span><span class="si">{</span><span class="n">response</span><span class="o">.</span><span class="n">url</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="k">return</span> <span class="n">inner</span>
<span class="k">def</span> <span class="nf">run_test</span><span class="p">(</span><span class="n">playwright</span><span class="p">:</span> <span class="n">Playwright</span><span class="p">):</span>
<span class="n">browser</span> <span class="o">=</span> <span class="n">playwright</span><span class="o">.</span><span class="n">chromium</span><span class="o">.</span><span class="n">launch</span><span class="p">()</span>
<span class="n">page</span> <span class="o">=</span> <span class="n">browser</span><span class="o">.</span><span class="n">new_page</span><span class="p">()</span>
<span class="n">page</span><span class="o">.</span><span class="n">goto</span><span class="p">(</span><span class="s2">"https://mirekdlugosz.com"</span><span class="p">)</span>
<span class="n">page</span><span class="o">.</span><span class="n">on</span><span class="p">(</span><span class="s2">"request"</span><span class="p">,</span> <span class="n">request_handler_factory</span><span class="p">(</span><span class="n">page</span><span class="p">))</span>
<span class="n">page</span><span class="o">.</span><span class="n">on</span><span class="p">(</span><span class="s2">"response"</span><span class="p">,</span> <span class="n">response_handler_factory</span><span class="p">(</span><span class="n">page</span><span class="p">))</span>
<span class="n">page</span><span class="o">.</span><span class="n">goto</span><span class="p">(</span><span class="s2">"https://httpbin.org/status/404"</span><span class="p">)</span>
<span class="n">browser</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="k">with</span> <span class="n">sync_playwright</span><span class="p">()</span> <span class="k">as</span> <span class="n">playwright</span><span class="p">:</span>
<span class="n">run_test</span><span class="p">(</span><span class="n">playwright</span><span class="p">)</span>
</code></pre></div>
<p>The key here is that inner function has access to all of enclosing scope, including values passed as arguments to outer function. This allows us to pass specific values that are only available in the place where outer function is called.</p>
<h2 id="summary"><a class="toclink" href="#summary">Summary</a></h2>
<p>The first solution is a little different than other two, because it does not solve all of the problems set forth. On the other hand, I think it’s the easiest to understand - even beginner Python programmers should intuitively grasp what is happening and why. </p>
<p>In my experience higher-order functions takes some getting used to, while <code>partial</code> is not well-known and may be confusing at first. But they do solve our problem completely.</p>iOS: Opening Google Drive links directly in an app2023-09-20T23:48:24+02:002023-09-20T23:48:24+02:00Mirek Długosztag:mirekdlugosz.com,2023-09-20:/blog/2023/ios-opening-google-drive-links-directly-in-an-app/<p>In <a href="https://associationforsoftwaretesting.org/"><span class="caps">AST</span></a>, we use Slack for internal communication and Google Drive for file sharing.
Since we are all volunteers, many people are active only in evenings.
And since there are many people in <span class="caps">US</span>, their evening might be well in the night for me.
So it happens that someone sends …</p><p>In <a href="https://associationforsoftwaretesting.org/"><span class="caps">AST</span></a>, we use Slack for internal communication and Google Drive for file sharing.
Since we are all volunteers, many people are active only in evenings.
And since there are many people in <span class="caps">US</span>, their evening might be well in the night for me.
So it happens that someone sends a link to Google Doc on Slack, I want to check it out, but I’m not in a mood to turn on my computer.</p>
<p>I use iPhone.
I have all Google apps installed and <span class="caps">AST</span> account logged in.
However, Slack can only open links in Safari or open Share Sheet, and Google apps did not register themselves in Share Sheet for whatever reason.
So there’s no easy way to open a link directly in an app.</p>
<p>Luckily, there is a solution.</p>
<p>As <a href="https://apple.stackexchange.com/a/227149">Apple StackExchange user points out, Google apps register secret protocols that can be used to open current page in a specific app</a>.
I guess these are used by “Open this page in $<span class="caps">APP</span>” bars that appear on top of page in a browser? <!-- $ -->
Either way, if you have a link to Google doc, something like <code>https://docs.google.com/document/d/somerandomhashhere/edit</code>, you can prepend <code>Googledocs://</code> protocol and Safari will open Google Docs app automatically.
In other words, <code>https://docs.google.com/document/d/somerandomhashhere/edit</code> opens Safari, but <code>Googledocs://https://docs.google.com/document/d/somerandomhashhere/edit</code> opens the same document in Google Docs app.</p>
<p>Of course copying a link, opening Safari, typing secret protocol and pasting a link gets boring very fast.
This is where Apple Shortcuts come in.</p>
<p>If you didn’t know, Shortcuts is built-in application that allows you to write scripts in something resembling <a href="https://scratch.mit.edu/">Scratch</a>.
Interface is rather clunky and debugging is a <span class="caps">PITA</span>, but Shortcuts has two things that make it stand out: it comes with built-in interfaces to most parts of the system, and it allows you to run custom programs on iOS device without thinking about Apple development ecosystem.</p>
<p>Here’s a recipe for a shortcut I have created.
I won’t explain precisely how to create a new one and where to click to modify it, as such instructions are at risk of getting outdated really quick.
See also a screenshot of how full shortcut looks on my phone, below.</p>
<ol>
<li>Add “Receive input from” action. In first placeholder, I unchecked everything but “Safari web pages” and “URLs”. In second placeholder, I checked “Show in Share Sheet”. I also selected that if there’s no input, it should “Get Clipboard”.</li>
<li>Add “Get text from” action. I have “Shortcut Input” selected here. I’m actually not 100% sure if this action is even needed. Maybe it just passes data verbatim to the next step.</li>
<li>Add “Combine with” action. In first placeholder, I have “Text”, which is a text from previous step. In second placeholder, I selected “Custom” and then typed “Googledocs://”. I <em>feel</em> I should be able to just concatenate two text streams, but I haven’t found more straightforward way of doing this.</li>
<li>Add “Open URLs” action. As only placeholder I have selected “Combined Text”, which is result of previous action.</li>
</ol>
<figure>
<a href="https://mirekdlugosz.com/blog/2023/ios-opening-google-drive-links-directly-in-an-app/ios-opening-google-drive-links-directly-in-an-app/shortcut-recipe.png">
<img src="https://mirekdlugosz.com/blog/2023/ios-opening-google-drive-links-directly-in-an-app/ios-opening-google-drive-links-directly-in-an-app/shortcut-recipe-min.png" title="Shortcut recipe for opening link directly in Google app" alt="Shortcut recipe for opening link directly in Google app" loading="lazy">
</a>
</figure>
<p>When you click a name on top, you can change the shortcut name and icon.
There are no Google apps icons and only few predefined colors are available, but you can make up something that looks close enough to Google branding.</p>
<p>As far as I can tell, you can send Google Slides link to Google Docs app and it will still do the right thing.
So you probably don’t need more than single generic “Open in Google app” shortcut.
But if it ever stops working like that, or you like selecting apps explicitly, you may duplicate your new shortcut and create new ones for Google Sheets, Google Slides and Google Drive (for links to directories).
For reference, here’s a list of protocols that they use:</p>
<ul>
<li><code>Googledocs://</code> - Google Docs (text documents)</li>
<li><code>Googlesheets://</code> - Google Sheets (spreadsheets)</li>
<li><code>Googleslides://</code> - Google Slides (presentations)</li>
<li><code>Googledrive://</code> - Google Drive (folders, directories)</li>
</ul>
<p>Now whenever someone sends a link to Google drive on Slack, I can long-press it, select “Share…” and pick one of my shortcuts from near the bottom.
This automatically opens a document in specified Google app.</p>Asking for ssh key passphrase when signing git commit2023-01-05T18:19:01+01:002023-01-05T18:19:01+01:00Mirek Długosztag:mirekdlugosz.com,2023-01-05:/blog/2023/asking-for-ssh-key-passphrase-when-signing-git-commit/<p>git has an option to sign commits and tags.
This allows you to verify that change indeed comes from a person it claims to come from.
Since 2.34.0, ssh can be used to sign things.
Which is nice, because everyone already has ssh configured to authorize pushes, so …</p><p>git has an option to sign commits and tags.
This allows you to verify that change indeed comes from a person it claims to come from.
Since 2.34.0, ssh can be used to sign things.
Which is nice, because everyone already has ssh configured to authorize pushes, so you can re-use the same key for authenticity certification.</p>
<p>I started signing all my commits in November 2022, using <a href="https://blog.dbrgn.ch/2021/11/16/git-ssh-signatures/">Danilo Bargen’s blog post</a> as a guide.
Instead of hard-coding my public ssh key in config file, I told git to get it from <code>ssh-add -L</code>.</p>
<p>This setup works well overall, but has one problem - <code>git commit</code> will fail if I forget to load private key into ssh keyring first. It’s easy enough to recover from this without losing commit message, but wouldn’t it be nice if git asked for ssh key password automatically?</p>
<p>Turns out it’s very simple to do!
First, create a helper script and save it somewhere.
Here’s mine:</p>
<div class="highlight"><pre><span></span><code><span class="ch">#!/usr/bin/env bash</span>
<span class="nv">SSH_KEY</span><span class="o">=</span><span class="k">$(</span>ssh-add<span class="w"> </span>-L<span class="k">)</span>
<span class="k">if</span><span class="w"> </span><span class="o">[</span><span class="w"> </span><span class="s2">"</span><span class="nv">$?</span><span class="s2">"</span><span class="w"> </span>-eq<span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="o">]</span><span class="p">;</span><span class="w"> </span><span class="k">then</span>
<span class="w"> </span><span class="nb">echo</span><span class="w"> </span><span class="s2">"</span><span class="nv">$SSH_KEY</span><span class="s2">"</span>
<span class="k">else</span>
<span class="w"> </span>ssh-add
<span class="w"> </span>ssh-add<span class="w"> </span>-L
<span class="k">fi</span>
</code></pre></div>
<p>Then, change <code>defaultKeyCommand</code> in global git config file to use a helper script:</p>
<div class="highlight"><pre><span></span><code><span class="k">[gpg "ssh"]</span>
<span class="w"> </span><span class="na">defaultKeyCommand</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">/path/to/helper/script</span>
</code></pre></div>
<p>Now <code>git commit</code> will ask for ssh key passphrase if no key has been loaded yet.</p>Improving Pelican website development loop2021-02-09T11:38:27+01:002021-02-09T11:38:27+01:00Mirek Długosztag:mirekdlugosz.com,2021-02-09:/blog/2021/improving-pelican-website-development-loop/<p>I spend more time tinkering with my website than writing actual content. Here’s how I streamlined feedback loop on my Pelican-based website.</p>
<p>I spend more time tinkering with my website than writing actual content. Here’s how I streamlined feedback loop on my Pelican-based website.</p>
<p>Over the years, my main development flow looked something like that:</p>
<ol>
<li>Change file</li>
<li>Possibly build static assets</li>
<li>Build site itself</li>
<li>Serve the site using some http server</li>
<li>Open the page in web browser, or refresh it if it’s already opened</li>
<li>Check how the site looks, find possible problems, mistakes, typos etc.</li>
<li>Shut down http server</li>
<li>Go back to step 1</li>
</ol>
<p>Since I started using it, Pelican itself included some convenience functions that removed few steps from my loop. Possibly the most significant was version 4.0.0, which included <code>--listen</code> flag. Along with <code>--autoreload</code>, it relieves me from having to manually build the site on source file change, starting http server and shutting it down.</p>
<p>Unfortunately, that does not apply to all the files I could be changing. I am using gulp to compile <span class="caps">SCSS</span> files into <span class="caps">CSS</span>, bundle styles/scripts into single file and minimize their size. While Pelican is able to pick up changes in theme files, it is not aware of theme source files and how to build them. So I still need to call <code>npx gulp</code> after each change when working on styling or scripts.</p>
<p>Another thing I missed a little is automatic refreshing of page in browser on website changes. Hugo and Vue.js development environments have that, and as far as I can tell, it completely solves the problem of browser using outdated, cached version of static resource. On Pelican, there were times when I had to manually navigate to stylesheet file and refresh it, since browser did not pick up changes I have made.</p>
<p>The other day, I learned that Pelican 4.1.0 introduced livereload server, which does exactly that. This is how default invoke task looks like:</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">livereload</span><span class="p">(</span><span class="n">c</span><span class="p">):</span>
<span class="kn">from</span> <span class="nn">livereload</span> <span class="kn">import</span> <span class="n">Server</span>
<span class="n">build</span><span class="p">(</span><span class="n">c</span><span class="p">)</span>
<span class="n">server</span> <span class="o">=</span> <span class="n">Server</span><span class="p">()</span>
<span class="c1"># Watch the base settings file</span>
<span class="n">server</span><span class="o">.</span><span class="n">watch</span><span class="p">(</span><span class="n">CONFIG</span><span class="p">[</span><span class="s1">'settings_base'</span><span class="p">],</span> <span class="k">lambda</span><span class="p">:</span> <span class="n">build</span><span class="p">(</span><span class="n">c</span><span class="p">))</span>
<span class="c1"># Watch content source files</span>
<span class="n">content_file_extensions</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'.md'</span><span class="p">,</span> <span class="s1">'.rst'</span><span class="p">]</span>
<span class="k">for</span> <span class="n">extension</span> <span class="ow">in</span> <span class="n">content_file_extensions</span><span class="p">:</span>
<span class="n">content_blob</span> <span class="o">=</span> <span class="s1">'</span><span class="si">{0}</span><span class="s1">/**/*</span><span class="si">{1}</span><span class="s1">'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">SETTINGS</span><span class="p">[</span><span class="s1">'PATH'</span><span class="p">],</span> <span class="n">extension</span><span class="p">)</span>
<span class="n">server</span><span class="o">.</span><span class="n">watch</span><span class="p">(</span><span class="n">content_blob</span><span class="p">,</span> <span class="k">lambda</span><span class="p">:</span> <span class="n">build</span><span class="p">(</span><span class="n">c</span><span class="p">))</span>
<span class="c1"># Watch the theme's templates and static assets</span>
<span class="n">theme_path</span> <span class="o">=</span> <span class="n">SETTINGS</span><span class="p">[</span><span class="s1">'THEME'</span><span class="p">]</span>
<span class="n">server</span><span class="o">.</span><span class="n">watch</span><span class="p">(</span><span class="s1">'</span><span class="si">{}</span><span class="s1">/templates/*.html'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">theme_path</span><span class="p">),</span> <span class="k">lambda</span><span class="p">:</span> <span class="n">build</span><span class="p">(</span><span class="n">c</span><span class="p">))</span>
<span class="n">static_file_extensions</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'.css'</span><span class="p">,</span> <span class="s1">'.js'</span><span class="p">]</span>
<span class="k">for</span> <span class="n">extension</span> <span class="ow">in</span> <span class="n">static_file_extensions</span><span class="p">:</span>
<span class="n">static_file</span> <span class="o">=</span> <span class="s1">'</span><span class="si">{0}</span><span class="s1">/static/**/*</span><span class="si">{1}</span><span class="s1">'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">theme_path</span><span class="p">,</span> <span class="n">extension</span><span class="p">)</span>
<span class="n">server</span><span class="o">.</span><span class="n">watch</span><span class="p">(</span><span class="n">static_file</span><span class="p">,</span> <span class="k">lambda</span><span class="p">:</span> <span class="n">build</span><span class="p">(</span><span class="n">c</span><span class="p">))</span>
<span class="c1"># Serve output path on configured host and port</span>
<span class="n">server</span><span class="o">.</span><span class="n">serve</span><span class="p">(</span><span class="n">host</span><span class="o">=</span><span class="n">CONFIG</span><span class="p">[</span><span class="s1">'host'</span><span class="p">],</span> <span class="n">port</span><span class="o">=</span><span class="n">CONFIG</span><span class="p">[</span><span class="s1">'port'</span><span class="p">],</span> <span class="n">root</span><span class="o">=</span><span class="n">CONFIG</span><span class="p">[</span><span class="s1">'deploy_path'</span><span class="p">])</span>
</code></pre></div>
<p>Apart from finding it little hard to read, I have certain reservations towards some of design choices. This version assumes Jinja files in theme are all in single directory, without any subdirectories - while I have some recurring sections extracted into partials. <span class="caps">JS</span> and <span class="caps">CSS</span> files are certainly most often changed, but sometimes I also work on images - and I would like Pelican to pick their changes, too. Finally, I don’t like how watcher’s callback function is specified four different times - if I wanted to change callback, I would have to do it multiple times.</p>
<p>After fixing these issues, I ended up with:</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">devserver</span><span class="p">(</span><span class="n">c</span><span class="p">):</span>
<span class="kn">from</span> <span class="nn">livereload</span> <span class="kn">import</span> <span class="n">Server</span>
<span class="n">server</span> <span class="o">=</span> <span class="n">Server</span><span class="p">()</span>
<span class="n">watched_globs</span> <span class="o">=</span> <span class="p">[</span>
<span class="n">CONFIG</span><span class="p">[</span><span class="s1">'settings_base'</span><span class="p">],</span>
<span class="sa">f</span><span class="s1">'</span><span class="si">{</span><span class="n">SETTINGS</span><span class="p">[</span><span class="s2">"PATH"</span><span class="p">]</span><span class="si">}</span><span class="s1">/**/*.md'</span><span class="p">,</span>
<span class="sa">f</span><span class="s1">'</span><span class="si">{</span><span class="n">SETTINGS</span><span class="p">[</span><span class="s2">"THEME"</span><span class="p">]</span><span class="si">}</span><span class="s1">/templates/**/*'</span><span class="p">,</span>
<span class="sa">f</span><span class="s1">'</span><span class="si">{</span><span class="n">SETTINGS</span><span class="p">[</span><span class="s2">"THEME"</span><span class="p">]</span><span class="si">}</span><span class="s1">/static/**/*'</span><span class="p">,</span>
<span class="p">]</span>
<span class="k">for</span> <span class="n">glob</span> <span class="ow">in</span> <span class="n">watched_globs</span><span class="p">:</span>
<span class="n">server</span><span class="o">.</span><span class="n">watch</span><span class="p">(</span><span class="n">glob</span><span class="p">,</span> <span class="k">lambda</span><span class="p">:</span> <span class="n">html</span><span class="p">(</span><span class="n">c</span><span class="p">))</span>
<span class="n">html</span><span class="p">(</span><span class="n">c</span><span class="p">)</span>
<span class="n">server</span><span class="o">.</span><span class="n">serve</span><span class="p">(</span><span class="n">host</span><span class="o">=</span><span class="n">CONFIG</span><span class="p">[</span><span class="s1">'host'</span><span class="p">],</span> <span class="n">port</span><span class="o">=</span><span class="n">CONFIG</span><span class="p">[</span><span class="s1">'port'</span><span class="p">],</span> <span class="n">root</span><span class="o">=</span><span class="n">CONFIG</span><span class="p">[</span><span class="s1">'deploy_path'</span><span class="p">])</span>
</code></pre></div>
<p>(Yes, I renamed <code>livereload</code> to <code>devserver</code> and <code>build</code> to <code>html</code>.)</p>
<p>Once I ran that task, I discovered it takes about 1 second for Pelican to rebuild site on any file change. That’s a little too much for my taste - I found myself waiting for refresh to happen, wondering if it wouldn’t be faster if I just hit refresh myself. I do want convenience, but not at the cost of that much speed.</p>
<p>Luckily, Pelican has built-in caching mechanism, and since version 4.5.0 allows to override specific settings from command line. I needed to change main building task slightly, and extracted watcher callback function for better readability:</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">html</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="n">extra_settings</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
<span class="n">cmd</span> <span class="o">=</span> <span class="s1">'-s </span><span class="si">{settings_base}</span><span class="s1">'</span>
<span class="k">if</span> <span class="n">extra_settings</span><span class="p">:</span>
<span class="n">cmd</span> <span class="o">=</span> <span class="sa">f</span><span class="s1">'</span><span class="si">{</span><span class="n">cmd</span><span class="si">}</span><span class="s1"> -e </span><span class="si">{</span><span class="n">extra_settings</span><span class="si">}</span><span class="s1">'</span>
<span class="n">pelican_run</span><span class="p">(</span><span class="n">cmd</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="o">**</span><span class="n">CONFIG</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">devserver</span><span class="p">(</span><span class="n">c</span><span class="p">):</span>
<span class="kn">from</span> <span class="nn">livereload</span> <span class="kn">import</span> <span class="n">Server</span>
<span class="k">def</span> <span class="nf">cached_html</span><span class="p">():</span>
<span class="n">html</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="n">extra_settings</span><span class="o">=</span><span class="s1">'CACHE_CONTENT=True LOAD_CONTENT_CACHE=True'</span><span class="p">)</span>
<span class="n">server</span> <span class="o">=</span> <span class="n">Server</span><span class="p">()</span>
<span class="n">watched_globs</span> <span class="o">=</span> <span class="p">[</span>
<span class="n">CONFIG</span><span class="p">[</span><span class="s1">'settings_base'</span><span class="p">],</span>
<span class="sa">f</span><span class="s1">'</span><span class="si">{</span><span class="n">SETTINGS</span><span class="p">[</span><span class="s2">"PATH"</span><span class="p">]</span><span class="si">}</span><span class="s1">/**/*.md'</span><span class="p">,</span>
<span class="sa">f</span><span class="s1">'</span><span class="si">{</span><span class="n">SETTINGS</span><span class="p">[</span><span class="s2">"THEME"</span><span class="p">]</span><span class="si">}</span><span class="s1">/templates/**/*'</span><span class="p">,</span>
<span class="sa">f</span><span class="s1">'</span><span class="si">{</span><span class="n">SETTINGS</span><span class="p">[</span><span class="s2">"THEME"</span><span class="p">]</span><span class="si">}</span><span class="s1">/static/**/*'</span><span class="p">,</span>
<span class="p">]</span>
<span class="k">for</span> <span class="n">glob</span> <span class="ow">in</span> <span class="n">watched_globs</span><span class="p">:</span>
<span class="n">server</span><span class="o">.</span><span class="n">watch</span><span class="p">(</span><span class="n">glob</span><span class="p">,</span> <span class="n">cached_html</span><span class="p">)</span>
<span class="n">cached_html</span><span class="p">()</span>
<span class="n">server</span><span class="o">.</span><span class="n">serve</span><span class="p">(</span><span class="n">host</span><span class="o">=</span><span class="n">CONFIG</span><span class="p">[</span><span class="s1">'host'</span><span class="p">],</span> <span class="n">port</span><span class="o">=</span><span class="n">CONFIG</span><span class="p">[</span><span class="s1">'port'</span><span class="p">],</span> <span class="n">root</span><span class="o">=</span><span class="n">CONFIG</span><span class="p">[</span><span class="s1">'deploy_path'</span><span class="p">])</span>
</code></pre></div>
<p>Now each file change results in full page rebuild after only 600 ms, which is reduction by around 40%. Same speedup ratio is maintained on another, more powerful machine, where caching brought rebuild time from 500 ms to 300 ms.</p>
<p>It saddens me a little that this is still few times slower than Hugo, which is capable of partial rebuilds in server mode. It seems that Pelican can do partial rebuilds, too, through <code>--write-selected</code> flag - however, it requires path to output file instead of source. Moreover, there doesn’t seem to be an easy way for liveserver watcher to tell callback function which file exactly changed.</p>
<p>While working on this, I discovered that I already have gulp task for watching file changes and rebuilding static assets. It works in similar fashion, as foreground process that blocks terminal and finishes on <code>Ctrl + C</code>. I decided to let my main task invoke it in the background and redirect its output to main terminal window. As a result, messages from two servers are intertwined, but at least I can see what is happening.</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">devserver</span><span class="p">(</span><span class="n">c</span><span class="p">):</span>
<span class="kn">from</span> <span class="nn">livereload</span> <span class="kn">import</span> <span class="n">Server</span>
<span class="k">def</span> <span class="nf">cached_html</span><span class="p">():</span>
<span class="n">html</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="n">extra_settings</span><span class="o">=</span><span class="s1">'CACHE_CONTENT=True LOAD_CONTENT_CACHE=True'</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">start_npm_devserver</span><span class="p">():</span>
<span class="n">cmd</span> <span class="o">=</span> <span class="s2">"npm run devserver"</span><span class="o">.</span><span class="n">split</span><span class="p">()</span>
<span class="n">proc</span> <span class="o">=</span> <span class="n">subprocess</span><span class="o">.</span><span class="n">Popen</span><span class="p">(</span>
<span class="n">cmd</span><span class="p">,</span>
<span class="n">stdout</span><span class="o">=</span><span class="n">sys</span><span class="o">.</span><span class="n">stdout</span><span class="p">,</span>
<span class="n">stderr</span><span class="o">=</span><span class="n">subprocess</span><span class="o">.</span><span class="n">STDOUT</span><span class="p">,</span>
<span class="n">cwd</span><span class="o">=</span><span class="n">SETTINGS</span><span class="p">[</span><span class="s2">"THEME"</span><span class="p">],</span>
<span class="p">)</span>
<span class="k">return</span> <span class="n">proc</span>
<span class="n">npm_devserver</span> <span class="o">=</span> <span class="n">start_npm_devserver</span><span class="p">()</span>
<span class="n">server</span> <span class="o">=</span> <span class="n">Server</span><span class="p">()</span>
<span class="n">watched_globs</span> <span class="o">=</span> <span class="p">[</span>
<span class="n">CONFIG</span><span class="p">[</span><span class="s1">'settings_base'</span><span class="p">],</span>
<span class="sa">f</span><span class="s1">'</span><span class="si">{</span><span class="n">SETTINGS</span><span class="p">[</span><span class="s2">"PATH"</span><span class="p">]</span><span class="si">}</span><span class="s1">/**/*.md'</span><span class="p">,</span>
<span class="sa">f</span><span class="s1">'</span><span class="si">{</span><span class="n">SETTINGS</span><span class="p">[</span><span class="s2">"THEME"</span><span class="p">]</span><span class="si">}</span><span class="s1">/templates/**/*'</span><span class="p">,</span>
<span class="sa">f</span><span class="s1">'</span><span class="si">{</span><span class="n">SETTINGS</span><span class="p">[</span><span class="s2">"THEME"</span><span class="p">]</span><span class="si">}</span><span class="s1">/static/**/*'</span><span class="p">,</span>
<span class="p">]</span>
<span class="k">for</span> <span class="n">glob</span> <span class="ow">in</span> <span class="n">watched_globs</span><span class="p">:</span>
<span class="n">server</span><span class="o">.</span><span class="n">watch</span><span class="p">(</span><span class="n">glob</span><span class="p">,</span> <span class="n">cached_html</span><span class="p">)</span>
<span class="n">cached_html</span><span class="p">()</span>
<span class="n">server</span><span class="o">.</span><span class="n">serve</span><span class="p">(</span><span class="n">host</span><span class="o">=</span><span class="n">CONFIG</span><span class="p">[</span><span class="s1">'host'</span><span class="p">],</span> <span class="n">port</span><span class="o">=</span><span class="n">CONFIG</span><span class="p">[</span><span class="s1">'port'</span><span class="p">],</span> <span class="n">root</span><span class="o">=</span><span class="n">CONFIG</span><span class="p">[</span><span class="s1">'deploy_path'</span><span class="p">])</span>
<span class="n">npm_devserver</span><span class="o">.</span><span class="n">terminate</span><span class="p">()</span>
</code></pre></div>
<p>That version accomplishes everything that I wanted: changing any file results in automatic site rebuild, changing sources for static asset rebuilds them, and page in browser is refreshed automatically. My main loop now looks something like that:</p>
<ol>
<li>Start development server (<code>inv devserver</code>)</li>
<li>Change file</li>
<li>Check how the site looks, find possible problems, mistakes, typos etc.</li>
<li>Go back to step 2</li>
</ol>Verify changes on your website before publishing them2019-12-05T11:44:26+01:002019-12-05T11:44:26+01:00Mirek Długosztag:mirekdlugosz.com,2019-12-05:/blog/2019/verify-changes-on-your-website-before-publishing-them/<p>Nice trick that lets you review changes done to website before publishing it. </p>
<p>Nice trick that lets you review changes done to website before publishing it. </p>
<p>I like static website generators. They are fast, secure, easy to tinker with and can be fully stored in version control system. </p>
<p>One neat thing they allow is ability to see changes before publishing them. The idea is to get two versions of website – before change and after change – and compare them automatically. Then it’s only matter of deciding if differences that you see are what you expected. </p>
<div class="highlight"><pre><span></span><code><span class="nb">cd</span><span class="w"> </span>~/path/to/website
make<span class="w"> </span>clean<span class="w"> </span><span class="o">&&</span><span class="w"> </span>make<span class="w"> </span>publish<span class="w"> </span><span class="c1"># build the website</span>
rsync<span class="w"> </span>-av<span class="w"> </span><span class="nv">$HOSTING</span>:domains/mirekdlugosz.com/public_html/<span class="w"> </span>/tmp/live-website<span class="w"> </span><span class="c1"># store live version in /tmp/</span>
git<span class="w"> </span>diff<span class="w"> </span>--no-index<span class="w"> </span>/tmp/live-website/<span class="w"> </span>~/path/to/website/output/<span class="w"> </span><span class="c1"># compare new and old</span>
</code></pre></div>
<p>I usually run these commands when something in my build pipeline changes and I want to ensure it is backwards compatible. That’s why I first copy existing website back to disk instead of rebuilding it – so I can be sure my reference version is bit-by-bit identical with live version.</p>Simple visual regression checking with Selenium and ImageMagick2019-11-24T20:48:57+01:002019-11-24T20:48:57+01:00Mirek Długosztag:mirekdlugosz.com,2019-11-24:/blog/2019/simple-visual-regression-checking-with-selenium-and-imagemagick/<p>I wanted to ensure that recent change did not break backwards compatibility and I ended up with visual regression checking script built with freely available software.</p>
<p>I wanted to ensure that recent change did not break backwards compatibility and I ended up with visual regression checking script built with freely available software.</p>
<p>Recently, I switched object ids used by <a href="https://createpokemon.team/">createpokemon.team</a>. One of the steps in entire process was creating backwards compatibility layer - these ids are exposed in <span class="caps">URL</span> and there might be bookmarks and links posted around which could suddenly stop loading some data. In my quest to gain confidence that this solution works, I created simple visual regression checking tool.</p>
<h2 id="talk-is-cheap-show-me-the-code"><a class="toclink" href="#talk-is-cheap-show-me-the-code">Talk is cheap, show me the code!</a></h2>
<p><a href="https://github.com/mirekdlugosz/scrapbook/tree/master/create-pokemon-team-visual-diff">Completed solution is hosted at GitHub</a>. This post is intertwined with code samples, but they are not intended to fully work on their own.</p>
<h2 id="testing-goals-and-strategy"><a class="toclink" href="#testing-goals-and-strategy">Testing goals and strategy</a></h2>
<p>Overarching goal of this activity was rather vague “demonstrating that existing URLs continue to work”.</p>
<p>There are two main sources of “existing URLs”. One is version deployed to production. I can fill the form, copy part of <span class="caps">URL</span> and test new version against it. Since I know how backwards compatibility procedure works, I can come up with data that might be problematic, as well as reference data that should not be problematic.</p>
<p>Another source are real URLs that real users navigated to out in the wild. Thankfully, I added Google Analytics to website, and it does provide comprehensive list of all URLs - along with number of visits for each. With that data, I can prioritize checking Pokemon, moves and teams that are most popular.</p>
<p><span class="dquo">“</span>Continue to work” means two things: that form is populated with team data provided in <span class="caps">URL</span>, and that analysis outcome is unchanged.</p>
<p>Since these are questions about data, it’s only natural to think about it in isolation of presentation. That reasoning would set us on path that includes gathering data from website – and since there is no machine-readable output available, that means scraping. But we can abuse the fact that there were no changes in <span class="caps">UI</span> and the same output will be presented in the same way. If there is no visible difference between old and new version, then data in both is sure to be the same. We don’t need to know what the data actually is.</p>
<h2 id="capturing-screenshot-with-selenium"><a class="toclink" href="#capturing-screenshot-with-selenium">Capturing screenshot with Selenium</a></h2>
<p>In first iteration of my work, I focused on gathering screen snapshot automatically. To do that, I need to open web browser, navigate to required page, ensure that all client-side operations have completed, actually capture image of visible site content and save that on disk. This can be done in just couple lines of code:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">random</span>
<span class="kn">from</span> <span class="nn">selenium</span> <span class="kn">import</span> <span class="n">webdriver</span>
<span class="n">teams</span> <span class="o">=</span> <span class="p">[]</span> <span class="c1"># loading URLs is skipped for brevity</span>
<span class="n">team</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">teams</span><span class="p">)</span>
<span class="n">chrome_options</span> <span class="o">=</span> <span class="n">webdriver</span><span class="o">.</span><span class="n">ChromeOptions</span><span class="p">()</span>
<span class="n">driver</span> <span class="o">=</span> <span class="n">webdriver</span><span class="o">.</span><span class="n">Chrome</span><span class="p">(</span><span class="n">options</span><span class="o">=</span><span class="n">chrome_options</span><span class="p">)</span>
<span class="n">base_url</span> <span class="o">=</span> <span class="s1">'http://localhost:4200'</span>
<span class="n">driver</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">base_url</span><span class="si">}{</span><span class="n">team</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="n">time</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
<span class="n">driver</span><span class="o">.</span><span class="n">save_screenshot</span><span class="p">(</span><span class="s1">'/tmp/selenium.png'</span><span class="p">)</span>
<span class="n">driver</span><span class="o">.</span><span class="n">quit</span><span class="p">()</span>
</code></pre></div>
<p>After confirming that it indeed opens required page and saves screenshot, I added two command line flags:</p>
<div class="highlight"><pre><span></span><code><span class="n">chrome_options</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s1">'--headless'</span><span class="p">)</span>
<span class="n">chrome_options</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s1">'--window-size=1920,2160'</span><span class="p">)</span>
</code></pre></div>
<p>This way browser opened by script is not visible on screen, so I can use computer without risk of interfering with automation. I increased window size vertically to gather entire page content on single run.</p>
<h2 id="visual-difference-between-two-images"><a class="toclink" href="#visual-difference-between-two-images">Visual difference between two images</a></h2>
<p>Thanks to <a href="https://imagemagick.org">ImageMagick</a> library and set of tools, visual difference between two images can be produced with single command:</p>
<div class="highlight"><pre><span></span><code>compare<span class="w"> </span>-compose<span class="w"> </span>src<span class="w"> </span>FIRST_FILE<span class="w"> </span>SECOND_FILE<span class="w"> </span>OUTPUT_FILE
</code></pre></div>
<p>I ran my script two times and saved page screenshots as two distinct files. After feeding them to above command, I obtained this (click to see full size):</p>
<figure>
<a href="https://mirekdlugosz.com/blog/2019/simple-visual-regression-checking-with-selenium-and-imagemagick/simple-visual-regression-checking-with-selenium-and-imagemagick/sample-difference.png">
<img src="https://mirekdlugosz.com/blog/2019/simple-visual-regression-checking-with-selenium-and-imagemagick/simple-visual-regression-checking-with-selenium-and-imagemagick/sample-difference-min.png" title="Sample visual difference between two teams" alt="Sample visual difference between two teams" loading="lazy">
</a>
<figcaption>Sample visual difference between two teams</figcaption>
</figure>
<h2 id="creating-safe-filenames"><a class="toclink" href="#creating-safe-filenames">Creating safe filenames</a></h2>
<p>I want the ability to track image with differences to <span class="caps">URL</span> that triggered them, in case I need to analyse them in closer detail.</p>
<p>Using <span class="caps">URL</span> as image name seems natural. Unfortunately, full team definition can be quite lengthy (longest <span class="caps">URL</span> in my sample is 528 characters long), and ext4 file system limits file name length to 255 bytes (characters). This is often not enough.</p>
<p>To ensure uniqueness of file name while maintaining its limited length, I decided to use hash (checksum) of <span class="caps">URL</span> string as file name. To meet traceability requirement, I stored both hash and <span class="caps">URL</span> in separate file.</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">hashlib</span>
<span class="k">def</span> <span class="nf">fs_sanitize</span><span class="p">(</span><span class="n">string</span><span class="p">):</span>
<span class="n">hash_</span> <span class="o">=</span> <span class="n">hashlib</span><span class="o">.</span><span class="n">sha256</span><span class="p">(</span><span class="n">string</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s1">'utf-8'</span><span class="p">))</span>
<span class="k">return</span> <span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">hash_</span><span class="o">.</span><span class="n">hexdigest</span><span class="p">()</span><span class="si">}</span><span class="s2">.png"</span>
<span class="n">map_handle</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'map.txt'</span><span class="p">,</span> <span class="s1">'w'</span><span class="p">)</span>
<span class="n">team</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">teams</span><span class="p">)</span>
<span class="n">fs_friendly_url</span> <span class="o">=</span> <span class="n">fs_sanitize</span><span class="p">(</span><span class="n">team</span><span class="p">)</span>
<span class="n">map_handle</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">fs_friendly_url</span><span class="si">}</span><span class="se">\t</span><span class="si">{</span><span class="n">team</span><span class="si">}</span><span class="se">\n</span><span class="s2">"</span><span class="p">)</span>
<span class="n">driver</span> <span class="o">=</span> <span class="n">webdriver</span><span class="o">.</span><span class="n">Chrome</span><span class="p">(</span><span class="n">options</span><span class="o">=</span><span class="n">chrome_options</span><span class="p">)</span>
<span class="n">base_url</span> <span class="o">=</span> <span class="s1">'http://localhost:4200'</span>
<span class="n">driver</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">base_url</span><span class="si">}{</span><span class="n">team</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="n">time</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
<span class="n">driver</span><span class="o">.</span><span class="n">save_screenshot</span><span class="p">(</span><span class="n">fs_friendly_url</span><span class="p">)</span>
<span class="n">driver</span><span class="o">.</span><span class="n">quit</span><span class="p">()</span>
<span class="n">map_handle</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
</code></pre></div>
<h2 id="optimizations"><a class="toclink" href="#optimizations">Optimizations</a></h2>
<p>Google analytics stored some 400 000 unique URLs. This is way too much to check during a weekend project, not to mention that they can be downloaded only in batches of 5000. </p>
<p>So first optimization is downloading only subset of them. I opted for 10 000. Given that from 1600th item onwards, each <span class="caps">URL</span> was accessed less than 10 times, this is essentially exhaustive list of “popular” URLs and some random sample of less-popular <span class="caps">URL</span>.</p>
<p>But 10 000 is still too much. Assuming it would take only 3 seconds to process one team, it would still take good 8 hours to process all of them. I further reduced size of that list by drawing random sample from it.</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">random</span>
<span class="n">teams_subset</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="n">teams</span><span class="p">,</span> <span class="mi">400</span><span class="p">)</span>
</code></pre></div>
<p>Initially, I aimed for code simplicity. Since I needed two screenshots to compare, it was obvious that I should use loop.</p>
<p>Then I realized that I am basically doubling the execution time for no good reason. Instead, I should start two web drivers at once, ask each to open different page, wait a little and then obtain both screenshots, even if that means there will be some duplicated code.</p>
<div class="highlight"><pre><span></span><code><span class="n">manager</span> <span class="o">=</span> <span class="p">{</span>
<span class="s2">"actual"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"driver"</span><span class="p">:</span> <span class="kc">None</span><span class="p">,</span>
<span class="s2">"dir"</span><span class="p">:</span> <span class="n">pathlib</span><span class="o">.</span><span class="n">Path</span><span class="p">(</span><span class="s1">'actual_results/'</span><span class="p">),</span>
<span class="s2">"base_url"</span><span class="p">:</span> <span class="s1">'http://localhost:4200'</span>
<span class="p">},</span>
<span class="s2">"expected"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"driver"</span><span class="p">:</span> <span class="kc">None</span><span class="p">,</span>
<span class="s2">"dir"</span><span class="p">:</span> <span class="n">pathlib</span><span class="o">.</span><span class="n">Path</span><span class="p">(</span><span class="s1">'expected_results/'</span><span class="p">),</span>
<span class="s2">"base_url"</span><span class="p">:</span> <span class="s1">'https://createpokemon.team'</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">for</span> <span class="n">run</span> <span class="ow">in</span> <span class="n">manager</span><span class="p">:</span>
<span class="n">manager</span><span class="p">[</span><span class="n">run</span><span class="p">][</span><span class="s2">"driver"</span><span class="p">]</span> <span class="o">=</span> <span class="n">webdriver</span><span class="o">.</span><span class="n">Chrome</span><span class="p">()</span>
<span class="k">for</span> <span class="n">team</span> <span class="ow">in</span> <span class="n">random</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="n">teams</span><span class="p">,</span> <span class="mi">400</span><span class="p">):</span>
<span class="n">fs_friendly_url</span> <span class="o">=</span> <span class="n">fs_sanitize</span><span class="p">(</span><span class="n">team</span><span class="p">)</span>
<span class="k">for</span> <span class="n">run</span> <span class="ow">in</span> <span class="n">manager</span><span class="o">.</span><span class="n">values</span><span class="p">():</span>
<span class="n">base_url</span> <span class="o">=</span> <span class="n">run</span><span class="p">[</span><span class="s2">"base_url"</span><span class="p">]</span>
<span class="n">run</span><span class="p">[</span><span class="s2">"driver"</span><span class="p">]</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">base_url</span><span class="si">}{</span><span class="n">team</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="n">time</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
<span class="k">for</span> <span class="n">run</span> <span class="ow">in</span> <span class="n">manager</span><span class="o">.</span><span class="n">values</span><span class="p">():</span>
<span class="n">screenshot_path</span> <span class="o">=</span> <span class="n">run</span><span class="p">[</span><span class="s2">"dir"</span><span class="p">]</span><span class="o">.</span><span class="n">joinpath</span><span class="p">(</span><span class="n">fs_friendly_url</span><span class="p">)</span>
<span class="n">run</span><span class="p">[</span><span class="s2">"driver"</span><span class="p">]</span><span class="o">.</span><span class="n">save_screenshot</span><span class="p">(</span><span class="n">screenshot_path</span><span class="o">.</span><span class="n">as_posix</span><span class="p">())</span>
<span class="k">for</span> <span class="n">run</span> <span class="ow">in</span> <span class="n">manager</span><span class="o">.</span><span class="n">values</span><span class="p">():</span>
<span class="n">run</span><span class="p">[</span><span class="s2">"driver"</span><span class="p">]</span><span class="o">.</span><span class="n">quit</span><span class="p">()</span>
</code></pre></div>
<h2 id="results-analysis"><a class="toclink" href="#results-analysis">Results analysis</a></h2>
<p>I started with sorting all created images by size. This allowed me to quickly identify outliers:</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>ls<span class="w"> </span>-lahSr<span class="w"> </span>diff/
...
-rw-r--r--<span class="w"> </span><span class="m">1</span><span class="w"> </span>mdlugosz<span class="w"> </span>mdlugosz<span class="w"> </span><span class="m">5</span>,6K<span class="w"> </span>Nov<span class="w"> </span><span class="m">24</span><span class="w"> </span><span class="m">14</span>:43<span class="w"> </span>67896595bd945c62fdb8c857afb6887baf50e1fb62904e9e7159fc034e7f0912.png
-rw-r--r--<span class="w"> </span><span class="m">1</span><span class="w"> </span>mdlugosz<span class="w"> </span>mdlugosz<span class="w"> </span><span class="m">5</span>,6K<span class="w"> </span>Nov<span class="w"> </span><span class="m">24</span><span class="w"> </span><span class="m">14</span>:36<span class="w"> </span>0487b61857b7417920d0cb3a70641e74d563e417f0354c94a9f66b292a10686e.png
-rw-r--r--<span class="w"> </span><span class="m">1</span><span class="w"> </span>mdlugosz<span class="w"> </span>mdlugosz<span class="w"> </span><span class="m">5</span>,7K<span class="w"> </span>Nov<span class="w"> </span><span class="m">24</span><span class="w"> </span><span class="m">14</span>:43<span class="w"> </span>869fce86191cf921fe253d1f1c792280b0c01d481a35b0da3d10ebe5b27824a6.png
-rw-r--r--<span class="w"> </span><span class="m">1</span><span class="w"> </span>mdlugosz<span class="w"> </span>mdlugosz<span class="w"> </span><span class="m">5</span>,7K<span class="w"> </span>Nov<span class="w"> </span><span class="m">24</span><span class="w"> </span><span class="m">14</span>:30<span class="w"> </span>5e2a96809439a5bac1d235b16544c2385532d1a1ad379abb1586256540d75140.png
-rw-r--r--<span class="w"> </span><span class="m">1</span><span class="w"> </span>mdlugosz<span class="w"> </span>mdlugosz<span class="w"> </span><span class="m">5</span>,9K<span class="w"> </span>Nov<span class="w"> </span><span class="m">24</span><span class="w"> </span><span class="m">15</span>:14<span class="w"> </span>9aa5ca526685da394c0cf401aa44596657298f19ee347b3f880c3f48e25b76a8.png
-rw-r--r--<span class="w"> </span><span class="m">1</span><span class="w"> </span>mdlugosz<span class="w"> </span>mdlugosz<span class="w"> </span><span class="m">6</span>,2K<span class="w"> </span>Nov<span class="w"> </span><span class="m">24</span><span class="w"> </span><span class="m">14</span>:01<span class="w"> </span>7ce7e7f3d05560e26981e6b9c23773a0372f6cf6f1bc21c0ed6a0f8d4da61447.png
-rw-r--r--<span class="w"> </span><span class="m">1</span><span class="w"> </span>mdlugosz<span class="w"> </span>mdlugosz<span class="w"> </span>20K<span class="w"> </span>Nov<span class="w"> </span><span class="m">24</span><span class="w"> </span><span class="m">13</span>:52<span class="w"> </span>862eceaaf4bcb06ffa0fdaf6b263999d1d5e2ec06b1f9d40533c311b9d89bef5.png
-rw-r--r--<span class="w"> </span><span class="m">1</span><span class="w"> </span>mdlugosz<span class="w"> </span>mdlugosz<span class="w"> </span>27K<span class="w"> </span>Nov<span class="w"> </span><span class="m">24</span><span class="w"> </span><span class="m">14</span>:51<span class="w"> </span>0133026b0a0f64ca7cb00529d083ace2effd89fb4d1279283ca6e6d8087cc35e.png
drwxr-xr-x<span class="w"> </span><span class="m">2</span><span class="w"> </span>mdlugosz<span class="w"> </span>mdlugosz<span class="w"> </span>72K<span class="w"> </span>Nov<span class="w"> </span><span class="m">24</span><span class="w"> </span><span class="m">15</span>:29<span class="w"> </span>.
</code></pre></div>
<p>It turned out there are some cases where the same team does not produce identically-looking pages, but not for the reason I was interested in. Some Pokemon changed their displayed name slightly and sometimes new name takes different number of rows than old one. As a result, considerable part of page got moved vertically, causing a big diff.</p>
<p>Another problem is that during development, new version uses different domain than existing instance, and current <span class="caps">URL</span> is displayed near the bottom of page. This caused all pairs to report some differences. I skimmed over all images to confirm there are no unexpected changes, but I should strive for making images really identical. This would allow me to exclude all images with exact same size from analysis, making it trivial to identify cases that differed in significant way.</p>
<figure>
<a href="https://mirekdlugosz.com/blog/2019/simple-visual-regression-checking-with-selenium-and-imagemagick/simple-visual-regression-checking-with-selenium-and-imagemagick/random-result.png">
<img src="https://mirekdlugosz.com/blog/2019/simple-visual-regression-checking-with-selenium-and-imagemagick/simple-visual-regression-checking-with-selenium-and-imagemagick/random-result-min.png" title="Random 'nothing interesting to see here, move along' image" alt="Random 'nothing interesting to see here, move along' image" loading="lazy">
</a>
<figcaption>Random ‘nothing interesting to see here, move along’ image</figcaption>
</figure>
<h2 id="conclusion-and-ideas-for-further-work"><a class="toclink" href="#conclusion-and-ideas-for-further-work">Conclusion and ideas for further work</a></h2>
<p><a href="https://github.com/mirekdlugosz/scrapbook/tree/master/create-pokemon-team-visual-diff">Final version of code I have used is on GitHub</a>.</p>
<p>While this solution did get the work done, it is not perfect. There is number of things that could be done to improve performance and maintainability:</p>
<ul>
<li>Proper logging and exception handling should be added.</li>
<li>Paths and parameters (like sample size) should be passed in as command line options, or loaded from environment.</li>
<li>Screenshots of one team should be bit-by-bit identical to allow easier results analysis. This could be achieved by adjusting browser window size or by changing development version to produce exact same <span class="caps">URL</span> as production instance.</li>
<li>Two webdriver instances are very far from fully utilizing available system resources. Main loop should be revamped to support larger number of concurrent web driver sessions. One way to achieve that is queueing mechanism, which would store list of URLs to process and assign them to web drivers that are free (web drivers would need to report they completed assigned work and can take up another task).</li>
<li>Fixed wait times are widely considered a code smell in web automation. Of course webdriver should take screenshot as soon as page has fully loaded team data.</li>
<li>Image diffs should be created in separate process. This would allow to fully utilize multiple CPUs on machine, but requires implementing another queueing mechanism (as well as efficient way to find pairs of images that were not yet processed).</li>
</ul>How to win 3rd place at TestingCup?2019-07-07T11:28:32+02:002019-07-07T11:28:32+02:00Mirek Długosztag:mirekdlugosz.com,2019-07-07:/blog/2019/how-to-win-3rd-place-at-testingcup/<p><a href="http://testingcup.pl/">TestingCup</a> is annual testing competition in Poland and this year, I won 3rd place in individual category.</p>
<p><a href="http://testingcup.pl/">TestingCup</a> is annual testing competition in Poland and this year, I won 3rd place in individual category.</p>
<h2 id="context-competition-rules"><a class="toclink" href="#context-competition-rules">Context: Competition rules</a></h2>
<p>During TestingCup, we are given three hours to test application that we see for the first time and which is crafted specifically for competition. We earn points and the winner is the person who collected the most of them.</p>
<p>Points are earned in two ways: by reporting bugs and by creating testing process artifact. Number of points from bug report depends on severity - critical security issue and application crashes are worth the most, duplicates are worth the least (actually, they are worth negative points). It pays off to go really deep and find important problems, but it also pays off to maximize coverage and find many issues. Testing process artifact is single document that must comply with “widely-used standard”, such as <span class="caps">IEEE</span>-829. It is graded against unknown checklist of elements it should contain - each checked box earns you some points. As far as I can tell, actual substance of document is of lesser importance.</p>
<p>Points are awarded by championships jury in non-transparent process. After the championships you are given your total number of points, but you don’t know how much points you earned for each activity, what was final severity of your bug reports and which boxes on artifact checklist were checked. I guess you can ask over email? Jury decisions are final and there is no appeal process in place. Jury promises that each bug report goes through at least two jury members and they discuss until disagreements are resolved.</p>
<p>You can download application and accompanying documents, including list of known bugs and exemplary artifacts, from <a href="http://www.mrbuggy.pl/">MrBuggy website</a>.</p>
<h2 id="prepare-your-machine"><a class="toclink" href="#prepare-your-machine">Prepare your machine</a></h2>
<p>During the competition, you are expected to use your own machine. Organizers provide some minimal requirements it must meet (native Windows installation, particular .<span class="caps">NET</span> version or newer, <span class="caps">RJ</span>-45 connection, sometime others) and list of forbidden activities (mainly communicating with external parties and decompiling). Everything in-between is fair play. Which places preparation of machine among the most important things you can do to maximize your chances of winning.</p>
<p>Install every development tool and productivity software you know how to use, and also some that you only heard about. Last year, at one point I discovered that application stores data in SQLite database, but I didn’t have tools to access it and poke around. This year, I installed git for Windows, Python, R with RStudio and tidyverse, Postman, LibreOffice suite, Greenshot, SQLiteBrowser, 7-zip and <span class="caps">VS</span> Code (including plugins for spell checking, linting and indentation). And probably some more. Even then, during competition there was a moment when I wished I had Jupyter Notebook installed.</p>
<p>Keep reference materials on your disk. When working on test process artifact, you might want to open <span class="caps">ISTQB</span> syllabus and ensure you haven’t missed something obvious. Last year, I did not have any testing resources on my machine and I am sure my test report wasn’t particularly good. This year, I had <span class="caps">ISTQB</span> syllabus, offline copies of <a href="https://www.developsense.com/">Michael Bolton</a> and <a href="https://www.satisfice.com/">James Bach</a> blogs and some other documents. Our task was to create test plan and I kept my cool just because I had access to article titled <a href="https://www.developsense.com/blog/2008/12/what-should-test-plan-contain/"><em>What Should A Test Plan Contain?</em></a>.</p>
<p>This kind of feels like cheating, but you might prepare templates for critical test process artifacts. Championships rules do not forbid it. That was my idea for this year - I copied one of test reports from previous competitions and intended to use it as template. I did not, as this year we had to create test plan.</p>
<h2 id="read-instructions"><a class="toclink" href="#read-instructions">Read instructions</a></h2>
<p>I know this one is mentioned in virtually every “how to pass FooBar exam/certification” article, but I underestimated how important it really is.</p>
<p>This year we had the opportunity to evaluate our own reports and judge (anonymized) work of others after the competition. And clearly, some people did not read the instructions, or failed to understand them. I saw bug reports for things that were explicitly included in list of “known issues”. I saw bug reports pointing out that features described in Change Request document are missing - you know, features that were requested by business, for which development has not yet started. I also saw test plan that was literally perfect, except for one small detail - it was completely off-topic, being based on delivered MrBuggy instead of Change Request document. I don’t know how many points this person earned, because “document is on-topic” was not on the list of things we were supposed to check.</p>
<p>I don’t want to bash these people or paint myself as superior. I want to stress out that you should read all of provided materials, especially instructions. And then you should read them again. And then you should read them from bottom to top, just to ensure you really understand what is expected from you, what will be held against you and what doesn’t matter at all.</p>
<h2 id="keep-it-simple"><a class="toclink" href="#keep-it-simple">Keep it simple</a></h2>
<p>You know how articles <a href="https://www.guru99.com/defect-management-process.html#2">introducing various test process artifacts list all kinds of stuff as required</a>? Following their advice is sure way to waste time and focus on least important tasks.</p>
<p>During competition, your bug reports must cover real problems and be understood by jury. There are no other requirements. Usually it’s good idea to provide steps to reproduce, but sometimes there are so short that you may skip them. There are situations when it’s required to point out what you expected to happen, but often this is obvious from context. You might describe testing environment in painstaking detail, but everyone has exactly the same, so why bother?</p>
<p>Same goes for the way you write your reports. Sure, you might show off your language proficiency, but is it worth it to spend 30 second looking for exact word that perfectly conveys what you mean? Someone else used simpler word and spend these 30 second thinking how to test specific requirement.</p>
<p>Simply put, don’t waste time on information that is not required or necessary. Use simple words and simple grammar. Keep your sentences short and on point. Focus on discovering important problems fast and make sure they are communicated clearly.</p>
<h2 id="track-your-time"><a class="toclink" href="#track-your-time">Track your time</a></h2>
<p>It’s pretty obvious, but important enough to state it explicitly. Keep track of time.</p>
<p>It’s very easy to forget about passage of time when you face serious and interesting challenge, or when you are extremely focused on task at hand. Yet competition do not provide luxury of spending as much time as you want on everything that piqued your interest. You have to consciously control amount of time spent on each activity and feature. Concentrating for one hour on one thing only is not worth it.</p>
<p>This also means you have to be relentless in deciding it’s time to move on. Sure, you might feel you are so close to revelation and be tempted to give it one more minute, but what you probably really feel is sunk cost fallacy. Leaving unfinished work is hard, but necessary. It might help to make a note so you can return to this problem later on.</p>
<h2 id="abuse-notes"><a class="toclink" href="#abuse-notes">Abuse notes</a></h2>
<p>This is another rather obvious, but nevertheless important point. You are working on the computer, which is able to store virtually unlimited amount of text. As part of conference pack you will be given pen and notebook. Make use of them.</p>
<p>This year, for the first time, <span class="caps">HTTP</span> <span class="caps">API</span> was supported way of interacting with MrBuggy. All calls required Authorization header, which had to include base64-encoded username and password. While organizers did provide simple tool to encode one string, re-typing usernames and copying them all the time would be huge waste of time. Instead, I kept encoded strings in VSCode. This way I could quickly select and copy them.</p>
<p>Last year, I added item to my notebook after covering each feature. This helped to direct further efforts into areas that were not yet tested, as well as provided overview for what I actually did. I also captured ideas that I would like to pursue further if time permits. This made it easier to leave some tasks unfinished when time for them was running out.</p>
<p>As you can see, notes don’t have to be used in creative way to be useful. Just keep in mind they are an option and use them every time they can support your main activities.</p>
<h2 id="dont-bother-with-live-results"><a class="toclink" href="#dont-bother-with-live-results">Don’t bother with live results</a></h2>
<p>Preliminary results are displayed live during the competition. You will do best if you ignore them completely.</p>
<p>Last year, my name was third on early results table (at least last time I saw it before competition ended). I felt pretty good about it and I thought I can actually get the trophy, so you can imagine my disappointment when final results were announced and I finished up seventh. This year, I fell out of top 10 around midway through the competition. Last time I saw my name, I had around 40 points. Near the end of competition, everyone had 50-70 points. That’s pretty big gap and I was sure there is no way for me to close it, so I accepted I will finish on worse place than previous year. You can imagine my surprise when I was announced as winner of 3rd place.</p>
<p>Live results are misleading in part because they don’t factor in test process artifact. It’s worth 20 or so points, so it can impact your results quite a bit.</p>
<p>But what is much more important, live results are based entirely on self-assigned categories. If you decide your bug is critical security issue, your total points will increase by 10. Later jury might decide this bug should have much lower severity and your final points will go down considerably.</p>
<p>You can easily secure top spot in live results - just report all your bugs as most critical. Live results are as easy to game as they are meaningless.</p>
<h2 id="practice-at-home"><a class="toclink" href="#practice-at-home">Practice at home?</a></h2>
<p>I have not followed this one myself, so I can’t say how important it actually is. Nevertheless, <a href="http://mrbuggy.pl/">MrBuggy website</a> provides software used in previous editions of championships, along with list of known issues and example test process artifacts. You can download it, set timer for three hours and do dry-run of competition. Just write down all bug reports and document in some local file. Afterwards, compare list of bugs you found with list of all known bugs. Which did you fail to find? Why? What could you do differently to earn more points?</p>
<h2 id="make-it-fun"><a class="toclink" href="#make-it-fun">Make it fun</a></h2>
<p>Last, but not least, try to be positive towards entire championships and just have fun. </p>
<p>Competition are not objective assessment of your skills, knowledge or worth as a tester. Neither are they very reliable measurement tool. As an example, the same person won first place in 2017, second place in 2018 and… fourteenth place this year. Shuffles like that are quite common and have many, many reasons.</p>
<p>Personally, I haven’t prepared at all for my first championships in 2018. For 2019, I made a point to prepare my machine, but mostly relied on instinct and natural approach to problems during the competition. Winning a trophy is nice, but it was never the goal for me - I mostly wanted to know how well I naturally stand against the others. As it turns out, pretty well.</p>
<p>As a closing remark: if you had fun during competition, if you learned a single lesson, if you improved your craft in any way - you are the true winner. It doesn’t matter if you were first or last in final standing.</p>Setting up Protractor on Debian GNU/Linux2017-04-18T12:45:12+02:002017-04-18T12:45:12+02:00Mirek Długosztag:mirekdlugosz.com,2017-04-18:/blog/2017/setting-up-protractor-on-debian-gnu-linux/<p><a href="http://www.protractortest.org/">Protractor</a> is test framework for web applications written on top of <a href="https://angular.io/">Angular</a>. Unfortunately, installing it on Debian is non-obvious, as it has not yet found its way into repository and existing documentation is catered to needs of Mac <span class="caps">OS</span> users. This guide will help you to get through this process without messing up your entire system.</p>
<p><a href="http://www.protractortest.org/">Protractor</a> is test framework for web applications written on top of <a href="https://angular.io/">Angular</a>. Unfortunately, installing it on Debian is non-obvious, as it has not yet found its way into repository and existing documentation is catered to needs of Mac <span class="caps">OS</span> users. This guide will help you to get through this process without messing up your entire system.</p>
<h2 id="install-node"><a class="toclink" href="#install-node">Install Node</a></h2>
<p>Protractor runs on top of <a href="https://nodejs.org">Node</a>, which you must install before doing anything else:</p>
<div class="highlight"><pre><span></span><code># apt-get install nodejs nodejs-legacy
</code></pre></div>
<p>Contrary to its name, <code>nodejs-legacy</code> is not legacy version of Node software, but compatibility layer that lets it use <code>node</code> binary name. Node in Debian is invoked using <code>nodejs</code> binary, because <a href="https://lists.debian.org/debian-devel-announce/2012/07/msg00002.html">another program in repository already provided <code>node</code> name</a>.
Since entire Node ecosystem expects <code>node</code> to refer to Node binary, installing compatibility layer saves a lot of hassle; and since <a href="https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=797929">that other program has been removed</a>, there is no reason to not install it.</p>
<p>As a side note, Debian proper provides ancient Node version that you shouldn’t bother with. Latest <span class="caps">LTS</span> version (as of time of this writing) can be found in <a href="https://packages.debian.org/experimental/nodejs">experimental</a>. Use <a href="http://jaqque.sbih.org/kplug/apt-pinning.html">apt-pinning</a> to get it.</p>
<p>While we are at installing packages from repository, you might also consider installing <code>jq</code> and <code>default-jre</code>. <code>jq</code> is nice little shell utility to retrieve data from <span class="caps">JSON</span> files; we will use it in next step when downloading npm. <code>default-jre</code> pulls in <code>openjdk-8-jre</code> (Java 8), which is required to run Selenium standalone server. Protractor can connect to browsers directly, so running Selenium standalone is not mandatory, but it seems to be preferred by majority of community. You might as well pull it in now.</p>
<h2 id="install-and-configure-npm"><a class="toclink" href="#install-and-configure-npm">Install and configure npm</a></h2>
<p>Protractor can be installed using npm, default package manager for Node development environment. Upstream distributes npm with Node itself, but Debian decided to decouple these packages. The problem is, Debian’s npm package fell out of grace many years ago and is so outdated, that it outright fails to install some packages. At time of this writing, it was even <a href="https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=857986">considered for removal</a>. Luckily, npm can be installed or updated separately from Node. By using npm. That leaves us in funny place, where we need npm, but we need npm to install npm.</p>
<p>This can be remedied by downloading npm manually and running this version to install npm.</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span><span class="nb">cd</span><span class="w"> </span>/tmp/
$<span class="w"> </span>wget<span class="w"> </span><span class="s1">'https://registry.npmjs.org/npm/'</span><span class="w"> </span>-O<span class="w"> </span>registry.json
$<span class="w"> </span>wget<span class="w"> </span><span class="s2">"https://registry.npmjs.org/npm/-/npm-</span><span class="k">$(</span>jq<span class="w"> </span>-r<span class="w"> </span><span class="s1">'."dist-tags".latest'</span><span class="w"> </span>registry.json<span class="k">)</span><span class="s2">.tgz"</span>
$<span class="w"> </span>tar<span class="w"> </span>xf<span class="w"> </span>/tmp/npm-*.tgz
</code></pre></div>
<p>At this point we have npm ready to use in <code>/tmp/package/bin/npm-cli.js</code>, but before we actually run it, we should consider one quirk of this package manager. Namely, it installs everything in <code>node_modules</code> subdirectory of <strong>current working directory</strong>. This makes it easy to create semi-virtual environment with all packages needed for project you are working on, but it also makes it easy to install binaries in random places in directory tree, never to find them again.</p>
<p>npm also supports installation of global packages, but they are system-wide and go to <code>/usr/</code>.</p>
<p>To solve these issues, we will configure npm to treat global packages as available to current user only. We will need to repeat this setup for each user on the system, but it is small price to pay for centralized storage of node modules accessible from anywhere without messing up with file system permissions.</p>
<p>Start by creating <code>~/.npm-global</code> directory. Directory structure inside closely resembles <code>~/.local</code>, which would be more appropriate from <span class="caps">XDG</span> point, but I do enjoy ability to <code>rm -rf</code> one directory in order to nuke entire thing.</p>
<p>Then, add <code>~/.npm-global/bin/</code> directory to your <code>$PATH</code>. For current shell session, this can be done with command below. For persistent change, you should modify <code>~/.profile</code> or <code>~/.bashrc</code> file. Depending on your system setup, you might need to log out and log in back again to see changes in new terminal emulator windows.</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span><span class="nb">export</span><span class="w"> </span><span class="nv">PATH</span><span class="o">=</span>~/.npm-global/bin:<span class="nv">$PATH</span>
</code></pre></div>
<p>Now you can set up npm to use new directory and install npm properly:</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>./package/bin/npm-cli.js<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>prefix<span class="w"> </span><span class="s1">'~/.npm-global'</span>
$<span class="w"> </span>./package/bin/npm-cli.js<span class="w"> </span>install<span class="w"> </span>-g<span class="w"> </span>npm@latest
</code></pre></div>
<h2 id="install-protractor"><a class="toclink" href="#install-protractor">Install Protractor</a></h2>
<p>After all these changes, you can finally install Protractor</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>npm<span class="w"> </span>install<span class="w"> </span>-g<span class="w"> </span>protractor
</code></pre></div>
<p>Protractor requires WebDriver browser drivers to be available. You can download and install them with command below; I have not yet found a way to use <code>chromiumdriver</code> package from Debian repository.</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>webdriver-manager<span class="w"> </span>update
</code></pre></div>
<p>As noted before, majority of Protractor users seem to talk with browsers through Selenium standalone server. If you are among them, make sure to run <code>webdriver-manager start</code> before running <code>protractor</code>. If you prefer for Protractor to talk with browsers directly (works only for chromedriver and geckodriver, i.e. Chrome, Chromium and Firefox), make sure that your <code>protractor.conf.js</code> file contains following line:</p>
<div class="highlight"><pre><span></span><code><span class="n">directConnect</span><span class="o">:</span><span class="w"> </span><span class="kc">true</span>
</code></pre></div>How to use R to recognize if given string is a word2016-02-28T19:01:56+01:002016-02-28T19:01:56+01:00Mirek Długosztag:mirekdlugosz.com,2016-02-28:/blog/2016/how-to-use-r-to-recognize-if-given-string-is-a-word/<p>StackOverflow user <em>seakyourpeak</em> <a href="http://stackoverflow.com/q/34514795/3552063">asked if R can be used to verify whether
given string is a word in English or not</a>. This is interesting problem that gives us opportunity to explore different kinds of correctness.</p>
<p>StackOverflow user <em>seakyourpeak</em> <a href="http://stackoverflow.com/q/34514795/3552063">asked if R can be used to verify whether
given string is a word in English or not</a>. This is interesting problem that gives us opportunity to explore different kinds of correctness.</p>
<h2 id="correct-answer"><a class="toclink" href="#correct-answer">Correct answer</a></h2>
<p>This is not possible.</p>
<h2 id="longer-and-incorrect-answer"><a class="toclink" href="#longer-and-incorrect-answer">Longer and incorrect answer</a></h2>
<p>In case you don’t find this answer satisfying, you can resort to checking string against dictionary of known English words.
This approach is fundamentally broken, as we will explain later on, but it will work in majority of cases.</p>
<p>First of all, we need a dictionary of “words” that we will match our data against. One such dictionary is compiled by Kevin Atkinson and distributed under open source license at <a href="http://wordlist.aspell.net/"><span class="caps">SCOWL</span> (And Friends)</a> website.</p>
<p>Local copy can be obtained by downloading data file and unzipping it. Distribution package contains plenty of directories, but <a href="http://wordlist.aspell.net/scowl-readme/"><span class="caps">README</span> file</a> says that we can ignore most of them. Our only point of interest is directory called <code>final</code> that contains generated words list. </p>
<p>Words are scattered across multiple files, as they are grouped by variant, category and size. Size can be understood as commonness, or rough probability that everyday English user will <strong>not</strong> be familiar with given word. This structure allows us to use <code>list.files()</code> function with <code>pattern</code> argument, or <code>grepl()</code> function, to pinpoint set of words that we deem correct.</p>
<p>The code below will set up <code>words</code> vector with all common English words from scratch. In real-life project you should probably use more persistent storage for words database files.</p>
<div class="highlight"><pre><span></span><code><span class="n">dict_dir</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">tempdir</span><span class="p">()</span>
<span class="n">dict_url</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s">'http://downloads.sourceforge.net/wordlist/scowl-2016.01.19.zip'</span>
<span class="n">dict_local_zip</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">file.path</span><span class="p">(</span><span class="n">dict_dir</span><span class="p">,</span><span class="w"> </span><span class="nf">basename</span><span class="p">(</span><span class="n">dict_url</span><span class="p">))</span>
<span class="nf">if </span><span class="p">(</span><span class="o">!</span><span class="w"> </span><span class="nf">file.exists</span><span class="p">(</span><span class="n">dict_local_zip</span><span class="p">))</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="nf">download.file</span><span class="p">(</span><span class="n">dict_url</span><span class="p">,</span><span class="w"> </span><span class="n">dict_local_zip</span><span class="p">)</span>
<span class="w"> </span><span class="nf">unzip</span><span class="p">(</span><span class="n">dict_local_zip</span><span class="p">,</span><span class="w"> </span><span class="n">exdir</span><span class="o">=</span><span class="n">dict_dir</span><span class="p">)</span>
<span class="p">}</span>
<span class="n">dict_files</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list.files</span><span class="p">(</span><span class="nf">file.path</span><span class="p">(</span><span class="n">dict_dir</span><span class="p">,</span><span class="w"> </span><span class="s">'final'</span><span class="p">),</span><span class="w"> </span><span class="n">full.names</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span>
<span class="n">dict_files_match</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">tools</span><span class="o">::</span><span class="nf">file_ext</span><span class="p">(</span><span class="n">dict_files</span><span class="p">))</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="m">60</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="nf">grepl</span><span class="p">(</span><span class="s">"english-"</span><span class="p">,</span><span class="w"> </span><span class="n">dict_files</span><span class="p">,</span><span class="w"> </span><span class="n">fixed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span>
<span class="n">dict_files</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">dict_files</span><span class="p">[</span><span class="w"> </span><span class="n">dict_files_match</span><span class="w"> </span><span class="p">]</span>
<span class="n">words</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">unlist</span><span class="p">(</span><span class="nf">sapply</span><span class="p">(</span><span class="n">dict_files</span><span class="p">,</span><span class="w"> </span><span class="n">readLines</span><span class="p">,</span><span class="w"> </span><span class="n">USE.NAMES</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">))</span>
<span class="nf">length</span><span class="p">(</span><span class="n">words</span><span class="p">)</span>
</code></pre></div>
<!-- -->
<div class="highlight"><pre><span></span><code>## [1] 119050
</code></pre></div>
<p>Finally we can verify if string is English word by checking if it exists in vector of known words. The nice thing is that we get vectorization for free.</p>
<div class="highlight"><pre><span></span><code><span class="nf">c</span><span class="p">(</span><span class="s">"knight"</span><span class="p">,</span><span class="w"> </span><span class="s">"stack"</span><span class="p">,</span><span class="w"> </span><span class="s">"selfie"</span><span class="p">,</span><span class="w"> </span><span class="s">"l8er"</span><span class="p">,</span><span class="w"> </span><span class="s">"googling"</span><span class="p">,</span><span class="w"> </span><span class="s">"echinuliform"</span><span class="p">)</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">words</span>
</code></pre></div>
<!-- -->
<div class="highlight"><pre><span></span><code>## [1] TRUE TRUE TRUE FALSE TRUE FALSE
</code></pre></div>
<p>Or, instead of downloading archive and reading files, we could just use <a href="https://cran.r-project.org/web/packages/qdapDictionaries/index.html"><code>qdapDictionaries</code></a> package and load
<code>GradyAugmented</code> dataset. This approach was suggested by another StackOverflow user. Its main benefit is easy integration with current R environment. <span class="caps">SCOWL</span>, however, offers more flexibility and larger dictionary.</p>
<div class="highlight"><pre><span></span><code><span class="nf">length</span><span class="p">(</span><span class="n">GradyAugmented</span><span class="p">)</span>
<span class="nf">c</span><span class="p">(</span><span class="s">"knight"</span><span class="p">,</span><span class="w"> </span><span class="s">"stack"</span><span class="p">,</span><span class="w"> </span><span class="s">"selfie"</span><span class="p">,</span><span class="w"> </span><span class="s">"l8er"</span><span class="p">,</span><span class="w"> </span><span class="s">"googling"</span><span class="p">,</span><span class="w"> </span><span class="s">"echinuliform"</span><span class="p">)</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">GradyAugmented</span>
</code></pre></div>
<!-- -->
<div class="highlight"><pre><span></span><code>## [1] 122806
## [1] TRUE TRUE FALSE FALSE FALSE FALSE
</code></pre></div>
<h2 id="long-explanation-of-correct-answer"><a class="toclink" href="#long-explanation-of-correct-answer">Long explanation of correct answer</a></h2>
<p>To answer original question, we must establish some definition of “word”.</p>
<p>The naïve approach would say that word is anything that can be found in dictionary, but if we head this direction, things will quickly go south.</p>
<p>The most obvious problem is that dictionaries are not created equal - some have more words than others. Admittedly, this is only small obstacle that can be easily overcome by always using the most complete dictionary available.</p>
<p>But this dictionary will still suffer from the same limitation that all dictionaries do - it is inherently reactive. There is some timespan that passes - must pass - between people starting to use word and dictionary including it. Length of this timespan will depend on how quickly dictionary creator notice that people use the word, decide that it is worth including and issue updated version. In the past, when new issues had to be printed, timespan was long enough that people could actually stop using the word <strong>before</strong> it was included.</p>
<p>Moreover, dictionary creators are people too and they might have certain vision of theirs work purpose. In particular, they seem to be attracted by linguistic prescriptivism for some reason. This point of view greatly extends inclusion timespan and might prevent some words from ever being included. </p>
<p>By solving these issues, we risk exaggerating in opposite direction - our dictionary might contain words that people never use and often don’t even understand. <a href="http://phrontistery.info/ihlstart.html">Dictionary of Unusual Words</a> is website dedicated to collecting some of these words. In their vast repertory there are gems like “muscariform” and “suaveolent”.</p>
<p>Clearly, better definition is needed.</p>
<p>Let’s say that word is the smallest element of language that has a meaning (by the way, <a href="https://en.wikipedia.org/wiki/Word">Wikipedia says that word is pretty much that</a>). This instantly solves problems of missing fresh words and including ones that nobody understands. It also avoids pitfall of appealing to social actors that might be biased; or does it?</p>
<p>If we agree on that definition and try to apply it, we will find ourselves asking “what does that string mean, if anything?”. It doesn’t take careful consideration to see that this is only illusive solution - all we have done is shifting attention from definition of “word” to definition of “meaning”.</p>
<p>In <a href="http://existentialcomics.com/comic/90">spirit of late Wittgenstein</a>, we could say that meaning of word can only be understood by how community of word-users participates in activities involving this word. While this definition has undeniable charm, it leads us back to social actors defining what is and what isn’t a word (by specifying whether something has a meaning or not). Except that this time it’s even worse.</p>
<p>For starters, community of word-users might as well be anonymous crowd with ambiguous boundaries. It’s not exactly environment that promotes consensus.</p>
<p>Furthermore, community of word-users is usually subset of community of interest. This is clearest in academia, where freshmen are interested in some field, but haven’t yet internalized language of that field. But sometimes people use words outside of their original context and community of word-users becomes orthogonal to community of interest. These pose substantial practical challenges, as groups of interest are easier to reach out than groups of word-users, but we risk reaching out wrong people.</p>
<p>Overall, Wittgenstein-inspired definition leads to rather uncomfortable situation where the same string is and isn’t a word, depending on chosen reference frame of community. Basically all jargon and specialist terminology fall into that category, but slang-, dialect- and cant-specific words do as well. One of my favorite examples is <em>bootstrapping</em> - it is hard to comprehend for people without proper background, it means different things in statistics and computer science, and it has <strong>few different meanings</strong> in second one.</p>
<p>Finally, thanks to human mind astonishing ability to infer meaning from broadly defined context, there are odd cases when we are positive that something is a word, but we don’t know its meaning. <em>Jabberwocky</em> is full of those. But since it does follow grammar rules, we are still able to get general idea what it is about.</p>
<p>As you should have realized by now, the main issue in using computer to verify if given string is a word in English or not is not in computational complexity of problem or limited resources, but in coming up with highly specific and sensitive algorithm. We intuitively know what words are and can recognize them among random characters, but coming up with strict and precise definition is extremely hard.</p>
<p><strong>Takeaway message</strong>: there are different levels of correctness, in the same way that there is a difference between statistical and practical significance. Fundamentally or substantially incorrect solutions might actually solve all practical problems, so they might be good enough after all.</p>The map of bakeries that sell genuine St. Martin Croissants2015-11-11T20:51:25+01:002015-11-11T20:51:25+01:00Mirek Długosztag:mirekdlugosz.com,2015-11-11:/blog/2015/the-map-of-bakeries-that-sell-genuine-st-martin-croissants/<p>97 years ago Poland regained independence after being partitioned for well over a century. The date coincidences with St. Martin Day, a holiday with pagan roots that somehow managed to be more important here in <a href="https://en.wikipedia.org/wiki/Pozna%C5%84">Poznań</a>. We celebrate by having a parade on one of main streets and eating ungodly amounts of <em>rogal świętomarciński</em> (St. Martin Croissant), a local cake with <span class="caps">PGI</span> status in European Union. In this blog post I will show how to plot locations of bakeries that are allowed to sell products with that name.</p>
<p>97 years ago Poland regained independence after being partitioned for well over a century. The date coincidences with St. Martin Day, a holiday with pagan roots that somehow managed to be more important here in <a href="https://en.wikipedia.org/wiki/Pozna%C5%84">Poznań</a>. We celebrate by having a parade on one of main streets and eating ungodly amounts of <em>rogal świętomarciński</em> (St. Martin Croissant), a local cake with <span class="caps">PGI</span> status in European Union. In this blog post I will show how to plot locations of bakeries that are allowed to sell products with that name.</p>
<h2 id="background"><a class="toclink" href="#background">Background</a></h2>
<p>In 2008, <em>rogal świętomarciński</em> gained protected geographical indication (<span class="caps">PGI</span>) in European Union. This means that all products sold under that name must meet certain criteria (composition, creation procedure, place of origin etc.). Local <em>Cech Cukierników i Piekarzy w Poznaniu</em> (Poznań Guild of Pastry Chefs and Bakers) verifies if cakes on market do meet these criteria. They also manage <a href="http://cechcukiernikowipiekarzy.pl/lista-cukierni-z-certyfikatem.html">a list of bakeries allowed to sell products under “St. Martin Croissant” name</a>. Some people say it protects customers, as it gives objective way of ensuring that whatever they are buying does have certain quality. Other people say it supports oligopoly and hinders competition, as one collective has final word in saying who can and who can’t use protected product name. Either way, it’s probably a good idea to know where genuine croissants are made - and this is what we will do.</p>
<h2 id="getting-the-data"><a class="toclink" href="#getting-the-data">Getting the data</a></h2>
<p>As in every analysis, the first step is obtaining the data. List of bakeries on Guild website provides names and addresses in convenient tabular form, what makes it a good starting point.</p>
<p>Thanks to <code>rvest</code> package, that list can be downloaded, extracted and converted into <code>data.frame</code> in just few lines of code. Usually I prefer to perform web scraping tasks with <code>xml2</code> package, because it allows for finer control, but that would be an overkill in this case.</p>
<div class="highlight"><pre><span></span><code><span class="nf">library</span><span class="p">(</span><span class="s">'rvest'</span><span class="p">)</span>
<span class="n">page.url</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s">'http://cechcukiernikowipiekarzy.pl/lista-cukierni-z-certyfikatem.html'</span>
<span class="n">page.content</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">read_html</span><span class="p">(</span><span class="n">page.url</span><span class="p">)</span>
<span class="n">bakeries</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">html_node</span><span class="p">(</span><span class="n">page.content</span><span class="p">,</span><span class="w"> </span><span class="s">'table'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span>
<span class="w"> </span><span class="nf">html_table</span><span class="p">(</span><span class="n">header</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span>
</code></pre></div>
<h3 id="cleaning-it-up"><a class="toclink" href="#cleaning-it-up">Cleaning it up</a></h3>
<p>Unfortunately, our <code>data.frame</code> is not exactly the same as table on website. Instead of 103 rows, we’ve got 113, and instead of 3 columns, we’ve got… 11?</p>
<p>That’s because Guild website contains nested tables, and they cause a bit of trouble for <code>rvest</code>. When such structure is supplied to <code>html_table</code> function, it might not be able to return <code>data.frame</code> with both correct dimensions and all the data. By default, preserving dimensions is deemed more important; users who prefer to retrieve as much data as possible may supply <code>fill=TRUE</code> argument and deal with untidy data on their own.</p>
<p>Since we are walking down a second path, now it’s time to clean up the data. We could use some clever custom algorithm for that, but dataset is rather small and only few rows are wrong, so I guess that manual corrections are good enough.</p>
<div class="highlight"><pre><span></span><code><span class="n">wrong.rows</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">97</span><span class="p">,</span><span class="m">98</span><span class="p">,</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="m">101</span><span class="p">,</span><span class="w"> </span><span class="m">103</span><span class="p">,</span><span class="m">104</span><span class="p">,</span><span class="w"> </span><span class="m">108</span><span class="o">:</span><span class="m">111</span><span class="p">)</span>
<span class="n">bakeries</span><span class="p">[</span><span class="n">wrong.rows</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bakeries</span><span class="p">[</span><span class="n">wrong.rows</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">]</span>
<span class="n">bakeries</span><span class="p">[</span><span class="m">107</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bakeries</span><span class="p">[</span><span class="m">107</span><span class="p">,</span><span class="w"> </span><span class="m">7</span><span class="p">]</span>
<span class="n">bakeries</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bakeries</span><span class="p">[</span><span class="m">-1</span><span class="o">*</span><span class="n">wrong.rows</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="m">3</span><span class="p">)]</span>
</code></pre></div>
<p>On side note: if you happen to wonder why they decided to use nested tables in the most straightforward table ever, the answer is that they didn’t. But they did use Microsoft Office to generate <span class="caps">HTML</span>. </p>
<h3 id="adding-coordinates"><a class="toclink" href="#adding-coordinates">Adding coordinates</a></h3>
<p>Now that we have reproduced website’s table in R, it’s time to translate addresses into geographical coordinates.</p>
<p>This is made trivial by <code>geocode</code> function in <code>ggmap</code> package. To obtain <code>data.frame</code> with longitude and latitude values, all we need to do is call <code>geocode(bakeries$`Miejsce produkcji`)</code>.</p>
<p>Of course <code>geocode</code> can’t get addresses’ coordinates out of thin air - it uses web service for that. By default it queries
<a href="http://www.datasciencetoolkit.org/">Data Science Toolkit</a>, but Google Maps <span class="caps">API</span> is supported as well.
There are many reasons to use Data Science Toolkit (including openness), and there are many reasons to avoid Google Maps (including privacy concerns). But they hardly matter when faced with much higher quality results that Google produces. In this example, Data Science Toolkit failed to get coordinates of seven addresses and missed another ten by some 7000 kilometers (4000 miles). On the other hand, Google Maps <span class="caps">API</span> failed in just four cases - and they all share one root cause that can be corrected by small adjustments to source data.</p>
<p>And since we will be using Google Maps in next step anyway, there are hardly any reasons to avoid Google Maps <span class="caps">API</span> right now.</p>
<div class="highlight"><pre><span></span><code><span class="nf">library</span><span class="p">(</span><span class="s">'ggmap'</span><span class="p">)</span>
<span class="n">bakeries</span><span class="p">[</span><span class="m">78</span><span class="p">,</span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sub</span><span class="p">(</span><span class="s">"Wlkp."</span><span class="p">,</span><span class="w"> </span><span class="s">"Wielkopolska"</span><span class="p">,</span><span class="w"> </span><span class="n">bakeries</span><span class="p">[</span><span class="m">78</span><span class="p">,</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">fixed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span>
<span class="n">bakeries</span><span class="p">[,</span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sub</span><span class="p">(</span><span class="s">"Wlkp."</span><span class="p">,</span><span class="w"> </span><span class="s">"Wielkopolski"</span><span class="p">,</span><span class="w"> </span><span class="n">bakeries</span><span class="p">[,</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">fixed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span>
<span class="n">coordinates</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">geocode</span><span class="p">(</span><span class="n">bakeries</span><span class="p">[,</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">source</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"google"</span><span class="p">)</span>
<span class="n">bakeries</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">cbind</span><span class="p">(</span><span class="n">bakeries</span><span class="p">,</span><span class="w"> </span><span class="n">coordinates</span><span class="p">)</span>
</code></pre></div>
<h2 id="plotting-data"><a class="toclink" href="#plotting-data">Plotting data</a></h2>
<p>At this point we have everything that we need to create bakeries map. While <code>ggmap</code> could be used to produce it in raster image format, it will require us to go through few iterations of image rendering just to grasp the data and decide what features are worth highlighting. Something a bit more dynamic, something that allows user to zoom, pan and click to learn more about selected locations, would be much better suited for data exploration purposes.</p>
<p>And creating that something is extremely easy thanks to <a href="https://developers.google.com/maps/documentation/javascript/">Google Maps JavaScript <span class="caps">API</span></a>, which solves all hard problems for us. We only really need basic <span class="caps">HTML</span> page, few lines of JavaScript to create map markers and data to plot.</p>
<p>We already have the last piece of puzzle, but only in R. We need to export it to a format that can be effortlessly handled by JavaScript, and that is long way of saying <span class="caps">JSON</span>. This is another task made easy by one of many packages in extensive R library.</p>
<div class="highlight"><pre><span></span><code><span class="nf">library</span><span class="p">(</span><span class="s">'rjson'</span><span class="p">)</span>
<span class="nf">writeLines</span><span class="p">(</span><span class="nf">toJSON</span><span class="p">(</span><span class="n">bakeries</span><span class="p">),</span><span class="w"> </span><span class="s">"./bakeries.json"</span><span class="p">)</span>
</code></pre></div>
<p>Handling <span class="caps">JSON</span> in JavaScript might be easy, but actually loading it is not. For security reasons, web browsers don’t provide <span class="caps">API</span> to read local files content and it seems that the only way to fetch remote ones are asynchronous <span class="caps">HTTP</span> requests. Unfortunately, standard JavaScript library that handles these is quite low level and forces us to deal with success codes, failed requests and possible timeouts. We can take that weight off our shoulders by using third-party library, but again, that will probably mean loading quite a lot of completely unwanted code.</p>
<p>Either way, when we finish loading the data, we have to loop over all items in <span class="caps">JSON</span> array. For each row we want to create new marker at given coordinates and attach function that will create new pop-up window with bakery details as reaction to click event.</p>
<div class="highlight"><pre><span></span><code><span class="nx">$</span><span class="p">.</span><span class="nx">getJSON</span><span class="p">(</span><span class="s2">"bakeries.json"</span><span class="p">,</span><span class="w"> </span><span class="kd">function</span><span class="p">(</span><span class="nx">data</span><span class="p">){</span>
<span class="w"> </span><span class="nx">items</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nx">data</span><span class="p">.</span><span class="nx">Wnioskodawca</span><span class="p">.</span><span class="nx">length</span><span class="p">;</span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="kd">var</span><span class="w"> </span><span class="nx">i</span><span class="o">=</span><span class="mf">0</span><span class="p">;</span><span class="w"> </span><span class="nx">i</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="nx">items</span><span class="p">;</span><span class="w"> </span><span class="nx">i</span><span class="o">++</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="kd">var</span><span class="w"> </span><span class="nx">marker</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="ow">new</span><span class="w"> </span><span class="nx">google</span><span class="p">.</span><span class="nx">maps</span><span class="p">.</span><span class="nx">Marker</span><span class="p">({</span>
<span class="w"> </span><span class="nx">position</span><span class="o">:</span><span class="w"> </span><span class="ow">new</span><span class="w"> </span><span class="nx">google</span><span class="p">.</span><span class="nx">maps</span><span class="p">.</span><span class="nx">LatLng</span><span class="p">(</span><span class="nx">data</span><span class="p">.</span><span class="nx">lat</span><span class="p">[</span><span class="nx">i</span><span class="p">],</span><span class="w"> </span><span class="nx">data</span><span class="p">.</span><span class="nx">lon</span><span class="p">[</span><span class="nx">i</span><span class="p">]),</span>
<span class="w"> </span><span class="nx">title</span><span class="o">:</span><span class="w"> </span><span class="nx">data</span><span class="p">.</span><span class="nx">Wnioskodawca</span><span class="p">[</span><span class="nx">i</span><span class="p">],</span>
<span class="w"> </span><span class="nx">map</span><span class="o">:</span><span class="w"> </span><span class="nx">map</span><span class="p">,</span>
<span class="w"> </span><span class="nx">icon</span><span class="o">:</span><span class="w"> </span><span class="s1">'rogal.png'</span>
<span class="w"> </span><span class="p">});</span>
<span class="w"> </span><span class="nx">google</span><span class="p">.</span><span class="nx">maps</span><span class="p">.</span><span class="nx">event</span><span class="p">.</span><span class="nx">addListener</span><span class="p">(</span><span class="nx">marker</span><span class="p">,</span><span class="w"> </span><span class="s1">'click'</span><span class="p">,</span><span class="w"> </span><span class="p">(</span><span class="kd">function</span><span class="p">(</span><span class="nx">marker</span><span class="p">,</span><span class="w"> </span><span class="nx">i</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="kd">function</span><span class="p">()</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="nx">infowindow</span><span class="p">.</span><span class="nx">setContent</span><span class="p">(</span><span class="s1">''</span><span class="p">.</span><span class="nx">concat</span><span class="p">(</span>
<span class="w"> </span><span class="s1">'<div id="content"><h2>'</span><span class="p">,</span><span class="w"> </span><span class="nx">data</span><span class="p">.</span><span class="nx">Wnioskodawca</span><span class="p">[</span><span class="nx">i</span><span class="p">],</span><span class="w"> </span><span class="s1">'</h2>'</span><span class="p">,</span>
<span class="w"> </span><span class="s1">'<p><b>Address</b>: '</span><span class="p">,</span><span class="w"> </span><span class="nx">data</span><span class="p">[</span><span class="s2">"Miejsce produkcji"</span><span class="p">][</span><span class="nx">i</span><span class="p">],</span>
<span class="w"> </span><span class="s1">'</div>'</span>
<span class="w"> </span><span class="p">));</span>
<span class="w"> </span><span class="nx">infowindow</span><span class="p">.</span><span class="nx">open</span><span class="p">(</span><span class="nx">map</span><span class="p">,</span><span class="w"> </span><span class="nx">marker</span><span class="p">);</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="p">})(</span><span class="nx">marker</span><span class="p">,</span><span class="w"> </span><span class="nx">i</span><span class="p">));</span>
<span class="w"> </span><span class="nx">Gmarkers</span><span class="p">.</span><span class="nx">push</span><span class="p">(</span><span class="nx">marker</span><span class="p">);</span>
<span class="w"> </span><span class="p">}</span>
<span class="p">})</span>
</code></pre></div>
<p>Finally, we have to create skeleton <span class="caps">HTML</span> and JavaScript code. If I were to include them in snippet of code, I would pretty much had to paste the entire page. If you want to see that part, go ahead and look at source of <a href="https://mirekdlugosz.com/static/2015/bakeries-map/index.htm">map I have prepared</a>.</p>
<h2 id="closing-words"><a class="toclink" href="#closing-words">Closing words</a></h2>
<p>In this blog post we have seen how to download data from website into R, use it to obtain coordinates of addresses and export data into <span class="caps">JSON</span>. Finally, we have used JavaScript to create dynamic map that can be used for data exploration. That last part was greatly inspired by <a href="http://r-video-tutorial.blogspot.com/">Fabio Veronesi</a>’s preceding <a href="http://r-video-tutorial.blogspot.com/2015/05/live-earthquake-map-with-shiny-and.html">work</a> that has not been mentioned before.</p>
<p>Both <a href="https://mirekdlugosz.com/static/2015/bakeries-map/transform-data.R">R code</a> and <a href="https://mirekdlugosz.com/static/2015/bakeries-map/index.htm">the final product (map)</a> are available for curious.</p>How to create ebook from website spanning multiple pages2015-09-18T00:04:32+02:002015-09-18T00:04:32+02:00Mirek Długosztag:mirekdlugosz.com,2015-09-18:/blog/2015/how-to-create-ebook-from-website-spanning-multiple-pages/<p>Every now and then I stumble upon book that disguises itself as website - it has table of contents and spans multiple interlinked pages, each dedicated to one coherent piece. Sometimes I like to take them offline, so I can read them on travel. Sometimes I like to take them off computer, because reading long passages of text on screen is not that convenient. Sometimes I would prefer to have them in ebook format, because that’s what they really are. Here’s how I do that.</p>
<p>Every now and then I stumble upon book that disguises itself as website - it has table of contents and spans multiple interlinked pages, each dedicated to one coherent piece. Sometimes I like to take them offline, so I can read them on travel. Sometimes I like to take them off computer, because reading long passages of text on screen is not that convenient. Sometimes I would prefer to have them in ebook format, because that’s what they really are. Here’s how I do that.</p>
<p><strong>Disclaimer #1</strong>: I am using Linux on my desktop, so instructions are for Linux computers. Both programs that I use,
<a href="https://www.httrack.com/">httrack</a> and <a href="http://calibre-ebook.com/">Calibre</a>, are cross-platform - you can run them on Windows and <span class="caps">OS</span> X too.</p>
<p><strong>Disclaimer #2</strong>: The example that I use, one of <a href="http://kbroman.org/">Karl Broman</a>’s tutorials, is released to public domain with source code publicly available. In this particular case you could save yourself a trouble by <code>git clone</code>-ing and <code>pandoc -t epub</code>-ing the entire thing. But usually you won’t have that luxury.</p>
<p><strong>Disclaimer #3</strong>: Most content that you find online is copyrighted, even if you can read it for free (as in <em>gratis</em>). Making offline copy probably falls under fair use. Sharing that copy with relatives or close friends probably does too, but is on slippery slope. Sharing that copy with anyone else is most likely illegal. Just don’t do it, ok?</p>
<p>For the sake of this article,
let’s say that I want to make ebook out of Karl Broman’s “<a href="http://kbroman.org/dataorg/">Organizing Data In Spreadsheets</a>” tutorial.</p>
<h2 id="download-website"><a class="toclink" href="#download-website">Download website</a></h2>
<p>First I download it to my disk:</p>
<div class="highlight"><pre><span></span><code>cd /tmp/
httrack http://kbroman.org/dataorg/ -a -v -I0
</code></pre></div>
<p><code>-v</code> is for verbose, so I know what is happening.
<code>-I0</code> (this is capital i and zero) is for don’t make <code>index.html</code> file and avoid confusion later on.
<code>-a</code> is for stay on the same address, i.e. download only files whose <span class="caps">URL</span> starts with address provided in command line. httrack is spider and it will happily follow any link that it finds, eventually downloading entire Internet. You have to limit the set of pages it should be interested in, and <code>-a</code> is the easiest way to do it. If the book you are interested in is not in one directory (or, more likely, in that directory there are also things you are not interested in and you can’t simply delete them after download is completed), you will have to tweak <a href="https://www.httrack.com/html/filters.html">httrack filters</a>.</p>
<p>As a result, new directory with domain name (<code>kbroman.org</code>) will be created in working directory. Since I work in temporary location and all these files will be discarded anyway, I don’t bother with <code>-N</code> option.</p>
<h2 id="convert-to-ebook"><a class="toclink" href="#convert-to-ebook">Convert to ebook</a></h2>
<p>Next, fire up Calibre.</p>
<p>Calibre is robust and highly customizable, so it provides one little, well-hidden checkbox that has power to suck out all joy of reading final ebook. I have to decide whether it should be selected or not right now, because after I import website into Calibre, it will be too late.</p>
<p>When this checkbox is selected, pages will be added in order of their appearane in first page of downloaded website. Calibre will then repeat that process for each linked page in search for missing content. If each chapter of your website is on exactly one web page, then you want to have checkbox selected.</p>
<p>When checkbox is unselected, wich is the default, Calibre will follow each link immediately and add pages in order it visits them. If first chapter of your book happens to contain link to last chapter, the last chapter will be immediately after first chapter in output ebook, which is not cool. But if each chapter page contains solely of links to pages that contain actual subchapters (like <a href="http://www-01.ibm.com/support/knowledgecenter/SSLVMB_23.0.0/statistics_mainhelp_ddita-gentopic1.dita"><span class="caps">SPSS</span> online help</a>), then you totally want checkbox unselected.</p>
<p><span class="caps">OK</span>, maybe that was convoluted. Let me get some visual aids here. Assume that this is the structure of links on website you have downloaded:</p>
<div class="highlight"><pre><span></span><code>book
├── A.html
│ ├── C.html
│ └── D.html
├── B.html
│ └── E.html
└── C.html
</code></pre></div>
<p>If checkbox is selected, ebook content will be A.html, B.html, C.html, D.html and E.html. If checkbox is not selected, ebook content will be A.html, C.html, D.html, B.html and E.html.</p>
<p>To find this checkbox, go to <code>Preferences</code>, click <code>Plugins</code>, find <code>HTML to ZIP</code> (it’s in <code>File type plugins</code> section) and click <code>Customize plugin</code>. We are talking about the one labeled <code>Add linked files in breadth first order</code>, which is the only one anyway.</p>
<p><img alt="Checbox from hell" src="https://mirekdlugosz.com/blog/2015/how-to-create-ebook-from-website-spanning-multiple-pages/ebook-from-html-tutorial/checkbox-from-hell.png"></p>
<p>In my case (I want to turn Karl Broman’s tutorial about data organization into ebook, remember?), I want this checkbox <strong>selected</strong>.</p>
<p>When I finally have this thing out of my way, I can add new book in <span class="caps">HTML</span> format. This can be done by selecting <code>Add books from a single directory</code> item under <code>Add books</code> menu; or by pressing <code>a</code> key when main window is focused. When file picker appear, navigate to directory where httrack downloaded website contet and select <code>index.html</code>.</p>
<p><img alt="Calibre New Books menu" src="https://mirekdlugosz.com/blog/2015/how-to-create-ebook-from-website-spanning-multiple-pages/ebook-from-html-tutorial/add-page.png"></p>
<p>Eventually book will appear on list in the centre of window, presumably with correct metadata. I fix metadata if needed, make sure that book is selected and click “Convert books”.</p>
<p>New window will appear. Target format can be selected in upper right corner. <span class="caps">EPUB</span> is default and is good enough for me, but select <span class="caps">MOBI</span> if you have Kindle.</p>
<p>Sometimes you might want to open <code>Structure Definition</code> section and remove <code>or name()='h2'</code> part from XPath. This will stop Calibre from entering page break before every <code>h2</code> tag. This tag is used to denote subchapters, which are often very short (less than one page of book). The correct setting of that field really depends on structure of particular website you are trying to convert.</p>
<p>Then, in <code>Table of Contents</code>, you might want to select <code>Manually fine-tune the ToC after conversion is completed</code>. This step will have to be done after each converter run, but will allow to correct terribly messed up table of contents.</p>
<p>Clicking <code>OK</code> will produce website in ebook format.</p>
<h2 id="tweak-output-file"><a class="toclink" href="#tweak-output-file">Tweak output file</a></h2>
<p>Ebook created with default options is readable, but rather cumbersome to navigate. Each chapter starts with web page header and ends with web page footer. These things obviously make perfect sense on live website, but are cruft in ebook and will only annoy me as I skip them. I think it’s best to remove them.</p>
<p>To do so, click <code>Convert books</code> button again and open <code>Search & Replace</code> section.</p>
<p>This section allows us to provide any number Search/Replacement pairs that will be processed while ebook is created. Search capabilities are virtually unlimited thanks to <a href="http://www.regular-expressions.info/">regular expressions</a>. After clicking wand icon, ebook source code will be displayed and we can see what exactly search expression will match.</p>
<p>Since I want to select elements in <span class="caps">HTML</span> structure and remove them, XPath would be much better. But Calibre supports only regexpes, so here goes.</p>
<ul>
<li>Start by pasting string that will unambiguously match beginning of unwanted <span class="caps">HTML</span> part. If you have used web browser’s developers tool for that, make sure to double-check string in Calibre preview. This is required because httrack is not only website downloader, but also parser, and might slightly modify internal structure of web page in process. <code><div class="x" id="y"></code> is the same as <code><div id="y" class="x"></code> for parser, but not for regular expression.</li>
<li>Then type <code>[\s\S]*?</code>, which basically means “any character any number of times”. If you thought that “any character” is represented by dot, you should know that dot represents any character <strong>except</strong> the new line, and page structure we are interested in will almost certainly span multiple lines. <code>\s</code> is any white space character, <code>\S</code> is every character different from white space, so group <code>[\s\S]</code> will match any character including new line. Question mark means that capturing should not be greedy.</li>
<li>Matching the end of unwanted string is the hardest part, because every closing tag in <span class="caps">HTML</span> looks exactly the same. The best method I have found so far is using positive lookahead to match, but not capture, <span class="caps">HTML</span> string that immediately follows unwanted part. When you find it, paste it and enclose in <code>(?=</code> and <code>)</code>.</li>
</ul>
<p>In my example, header can be matched by this regexp:</p>
<div class="highlight"><pre><span></span><code><span class="nt"><div</span><span class="w"> </span><span class="na">class=</span><span class="s">"navbar"</span><span class="nt">></span>[\s\S]*?<span class="nt"></div></span>\s*(?=<span class="nt"><div</span><span class="w"> </span><span class="na">class=</span><span class="s">"container-narrow"</span><span class="nt">></span>)
</code></pre></div>
<p>And footer (including superfluous link to next section) by this one:</p>
<div class="highlight"><pre><span></span><code><span class="nt"><hr/></span>\s*<span class="nt"><p></span>Next<span class="w"> </span>up[\s\S]*?<span class="nt"></footer></span>
</code></pre></div>
<p>Replacement text is empty, because I want to remove these parts.</p>
<p><img alt="Search and Replace section with expressions fille in" src="https://mirekdlugosz.com/blog/2015/how-to-create-ebook-from-website-spanning-multiple-pages/ebook-from-html-tutorial/search-and-replace.png"></p>
<p>I click <code>OK</code> to prepare book in ebook format. This time I obtain something that actually can be read in sequence without interruptions.</p>
<h2 id="finishing-touches"><a class="toclink" href="#finishing-touches">Finishing touches</a></h2>
<p>If you are still not happy about output file, play around with multitude of Calibre’s options.
Their enormous <a href="http://manual.calibre-ebook.com/">user manual</a> might be helpful.</p>