Parsing HTML in Dart with Unit tests
Today I wanted to build a simple web crawler in Dart with unit tests. The road turned out to be bumpier than I had expected. I'll write down what I learned.
Making a get request
A Google search leads to this Dart cookbook, and shows that it's as simple as adding the http package, then:
import 'package:http/http.dart' as http;
...
http.get(Uri.parse('https://...'))
And indeed, it worked fine.
Parsing the HTML
Now it gets slightly trickier. It's actually really simple, but for some reason, I started in the wrong direction: https://api.dart.dev/stable/2.15.1/dart-html/DomParser/parseFromString.html. The doc is surprisingly sparse. They don't even document what type
should be. On MDN, I found some more documentation (see here), so I tried:
import 'dart:html';
...
DomParser().parseFromString(response.body, 'text/html');
But it failed with this error:
: Error: Not found: 'dart:html'
lib/my_crawler.dart:1
import 'dart:html';
^
: Error: Method not found: 'DomParser'.
lib/my_crawler.dart:20
DomParser().parseFromString(doc.body, 'text/html');
^^^^^^^^^
The reason is because dart:html is in fact only available in the Browser, ie only if you're making Dart Web apps.
To parse HTML on other platforms, one should use the html package.
import 'package:html/parser.dart' show parse;
import 'package:http/http.dart' as http;
...
final response = await http.get(uri);
final document = parse(response.body);
final links = document.querySelectorAll('a');
final uris = links.map((link) => Uri.parse(link.attributes['href'] as String));
Perfect! Now let's write unit tests so we can start iterating faster.
Unit tests
Preparing the test data
Unit tests will help me iterate faster, because I won't have to download the same HTML page every time I run the crawler. Instead, I'll download the page once, and the test will read that file locally. Here's how to download the page:
% curl https://www.videocardbenchmark.net/common_gpus.html > common_gpus.html
Note: when the page isn't found, the > filename
trick does not work. It simply creates an empty file instead of the 404 error page. To capture the 404 error page, I had to use the --output filename
parameter. Finally, for URL's that are slightly more complicated (something?a=123&b=456), you have to pass the URI within quotes :
% curl --output GeForce_RTX_3080_Ti.html "https://www.videocardbenchmark.net/gpu.php?gpu=GeForce+RTX+3080+Ti&id=4409"
Mocking http
Change the function signature to pass mocks
Next we need to use a mock http Client. So instead of the static http.get
, we should pass a specific client to the crawler. So I changed the crawl function's signature from:
static Future<Map<String, Benchmark>> crawlPage(Uri uri)
to:
static Future<Map<String, Benchmark>> crawlPage(Uri uri, Client client)
In my application, I have to pass an actual client:
MyCrawler.crawlPage(Uri.parse('https://...'), Client());
So that in my test, I can pass a mock:
import 'dart:io';
import 'package:gpu_benchmarks/videocardbenchmark_crawler.dart';
import 'package:http/http.dart';
import 'package:mockito/mockito.dart';
import 'package:test/test.dart';
main() {
test('Crawls a page', () async {
const url = 'https://www.videocardbenchmark.net/common_gpus.html';
final client = MockClient();
final uri = Uri.parse(url);
...
final r = await VideoCardBenchmarkCrawler.crawlPage(uri, client);
expect(r.length, equals(100));
});
}
class MockClient extends Mock implements Client {}
Mocking an http response
Finally, I want to make my MockClient return the file I have previously downloaded. According to the Mockito doc, it's as simpe as:
final mockResponse = Response(File('./test/common_gpus.html').readAsStringSync(), 200);
when(client.get(any)).thenReturn(Future.value(mockResponse));
But I get this error:
It turns out this is a new type of error that appeared with the introduction of null safety in Dart. Since any
matches null
, and Client.get
expects a non-null Uri
, it fails to compile.
I tried to get around the error by not using any
, and matching with the exact value instead:
when(client.get(uri)).thenReturn(...);
It compiles but at runtime, it throws:
type 'Null' is not a subtype of type 'Future<Response>' at MockClient.get
It seems like the mock client.get
returns null
, which the runtime won't accept since it expects a Future<Response>
. So at this point it's better to follow the doc than try the old ways randomly.
Mockito now has a full doc about how to handle null safety. See here. Interestingly, it uses an example about mocking a client response as well:
The doc explains that the new way to mock my Client is to add an annotation that tells Mockito to generate the Mock using the build_runner package.
import 'dart:io';
import 'package:http/http.dart';
import 'package:mockito/annotations.dart';
import 'package:mockito/mockito.dart';
import 'package:test/test.dart';
@GenerateMocks([Client])
main() {
test('Crawls a page', () async {
...
The doc shows the command to run build_runner once everything is set up, but it shows the outdated command pub run build_runner build
. The new one is:
% dart run build_runner build
Running build_runner generates the MockClient
in crawler_test.mocks.dart
and I can now import that class in my test. The generated function actually accepts a nullable (Uri?
) instead, which means you can use any
without a problem.
Finally, I had one last runtime error:
Invalid argument(s): `thenReturn` should not be used to return a Future. Instead, use `thenAnswer((_) => future)`.
That's because for futures, we're supposed to use thenAnswer
. Then our code becomes:
final mockResponse = Response(File('./test/common_gpus.html').readAsStringSync(), 200);
when(client.get(any)).thenAnswer((_) => Future.value(mockResponse));
And now the test finally runs successfully.