e15e8f2cc7
Change-Id: Iade055ff076b107ac1665493dad18e812025c31f
325 lines
16 KiB
Markdown
325 lines
16 KiB
Markdown
# WebM Parser {#mainpage}
|
|
|
|
# Introduction
|
|
|
|
This WebM parser is a C++11-based parser that aims to be a safe and complete
|
|
parser for WebM. It supports all WebM elements (from the old deprecated ones to
|
|
the newest ones like `Colour`), including recursive elements like `ChapterAtom`
|
|
and `SimpleTag`. It supports incremental parsing; parsing may be stopped at any
|
|
point and resumed later as needed. It also supports starting at an arbitrary
|
|
WebM element, so parsing need not start from the beginning of the file.
|
|
|
|
The parser (`WebmParser`) works by being fed input data from a data source (an
|
|
instance of `Reader`) that represents a WebM file. The parser will parse the
|
|
WebM data into various data structures that represent the encoded WebM elements,
|
|
and then call corresponding `Callback` event methods as the data structures are
|
|
parsed.
|
|
|
|
# Building
|
|
|
|
CMake support has been added to the root libwebm `CMakeLists.txt` file. Simply
|
|
enable the `ENABLE_WEBM_PARSER` feature if using the interactive CMake builder,
|
|
or alternatively pass the `-DENABLE_WEBM_PARSER:BOOL=ON` flag from the command
|
|
line. By default, this parser is not enabled when building libwebm, so you must
|
|
explicitly enable it.
|
|
|
|
Alternatively, the following illustrates the minimal commands necessary to
|
|
compile the code into a static library without CMake:
|
|
|
|
```.sh
|
|
c++ -Iinclude -I. -std=c++11 -c src/*.cc
|
|
ar rcs libwebm.a *.o
|
|
```
|
|
|
|
# Using the parser
|
|
|
|
There are 3 basic components in the parser that are used: `Reader`, `Callback`,
|
|
and `WebmParser`.
|
|
|
|
## `Reader`
|
|
|
|
The `Reader` interface acts as a data source for the parser. You may subclass it
|
|
and implement your own data source if you wish. Alternatively, use the
|
|
`FileReader`, `IstreamReader`, or `BufferReader` if you wish to read from a
|
|
`FILE*`, `std::istream`, or `std::vector<std::uint8_t>`, respectively.
|
|
|
|
The parser supports `Reader` implementations that do short reads. If
|
|
`Reader::Skip()` or `Reader::Read()` do a partial read (returning
|
|
`Status::kOkPartial`), the parser will call them again in an attempt to read
|
|
more data. If no data is available, the `Reader` may return some other status
|
|
(like `Status::kWouldBlock`) to indicate that no data is available. In this
|
|
situation, the parser will stop parsing and return the status it received.
|
|
Parsing may be resumed later when more data is available.
|
|
|
|
When the `Reader` has reached the end of the WebM document and no more data is
|
|
available, it should return `Status::kEndOfFile`. This will cause parsing to
|
|
stop. If the file ends at a valid location (that is, there aren't any elements
|
|
that have specified a size that indicates the file ended prematurely), the
|
|
parser will translate `Status::kEndOfFile` into `Status::kOkCompleted` and
|
|
return it. If the file ends prematurely, the parser will return
|
|
`Status::kEndOfFile` to indicate that.
|
|
|
|
Note that if the WebM file contains elements that have an unknown size (or a
|
|
seek has been performed and the parser doesn't know the size of the root
|
|
element(s)), and the parser is parsing them and hits end-of-file, the parser may
|
|
still call `Reader::Read()`/`Reader::Skip()` multiple times (even though they've
|
|
already reported `Status::kEndOfFile`) as nested parsers terminate parsing.
|
|
Because of this, `Reader::Read()`/`Reader::Skip()` implementations should be
|
|
able to handle being called multiple times after the file's end has been
|
|
reached, and they should consistently return `Status::kEndOfFile`.
|
|
|
|
The three provided readers (`FileReader`, `IstreamReader`, and `BufferReader`)
|
|
are blocking implementations (they won't return `Status::kWouldBlock`), so if
|
|
you're using them the parser will run until it entirely consumes all their data
|
|
(unless, of course, you request the parser to stop via `Callback`... see the
|
|
next section).
|
|
|
|
## `Callback`
|
|
|
|
As the parser progresses through the file, it builds objects (see
|
|
`webm/dom_types.h`) that represent parsed data structures. The parser then
|
|
notifies the `Callback` implementation as objects complete parsing. For some
|
|
data structures (like frames or Void elements), the parser notifies the
|
|
`Callback` and requests it to consume the data directly from the `Reader` (this
|
|
is done for structures that can be large/frequent binary blobs in order to allow
|
|
you to read the data directly into the object/type of your choice, rather than
|
|
just reading them into a `std::vector<std::uint8_t>` and making you copy it into
|
|
a different object if you wanted to work with something other than
|
|
`std::vector<std::uint8_t>`).
|
|
|
|
The parser was designed to parse the data into objects that are small enough
|
|
that the `Callback` can be quickly and frequently notified as soon as the object
|
|
is ready, but large enough that the objects received by the `Callback` are still
|
|
useful. Having `Callback` events for every tiny integer/float/string/etc.
|
|
element would require too much assembly and work to be useful to most users, and
|
|
pasing the file into a single DOM tree (or a small handful of large conglomerate
|
|
structures) would unnecessarily delay video playback or consume too much memory
|
|
on smaller devices.
|
|
|
|
The parser may call the following methods while nearly anywhere in the file:
|
|
|
|
- `Callback::OnElementBegin()`: This is called for every element that the
|
|
parser encounters. This is primarily useful if you want to skip some
|
|
elements or build a map of every element in the file.
|
|
- `Callback::OnUnknownElement()`: This is called when an element is either not
|
|
a valid/recognized WebM element, or it is a WebM element but is improperly
|
|
nested (e.g. an EBMLVersion element inside of a Segment element). The parser
|
|
doesn't know how to handle the element; it could just skip it but instead
|
|
defers to the `Callback` to decide how it should be handled. The default
|
|
implementation just skips the element.
|
|
- `Callback::OnVoid()`: Void elements can appear anywhere in any master
|
|
element. This method will be called to handle the Void element.
|
|
|
|
The parser may call the following methods in the proper nesting order, as shown
|
|
in the list. A `*Begin()` method will always be matched up with its
|
|
corresponding `*End()` method (unless a seek has been performed). The parser
|
|
will only call the methods in the proper nesting order as specified in the WebM
|
|
DOM. For example, `Callback::OnEbml()` will never be called in between
|
|
`Callback::OnSegmentBegin()`/`Callback::OnSegmentEnd()` (since the EBML element
|
|
is not a child of the Segment element), and `Callback::OnTrackEntry()` will only
|
|
ever be called in between
|
|
`Callback::OnSegmentBegin()`/`Callback::OnSegmentEnd()` (since the TrackEntry
|
|
element is a (grand-)child of the Segment element and must be contained by a
|
|
Segment element). `Callback::OnFrame()` is listed twice because it will be
|
|
called to handle frames contained in both SimpleBlock and Block elements.
|
|
|
|
- `Callback::OnEbml()`
|
|
- `Callback::OnSegmentBegin()`
|
|
- `Callback::OnSeek()`
|
|
- `Callback::OnInfo()`
|
|
- `Callback::OnClusterBegin()`
|
|
- `Callback::OnSimpleBlockBegin()`
|
|
- `Callback::OnFrame()`
|
|
- `Callback::OnSimpleBlockEnd()`
|
|
- `Callback::OnBlockGroupBegin()`
|
|
- `Callback::OnBlockBegin()`
|
|
- `Callback::OnFrame()`
|
|
- `Callback::OnBlockEnd()`
|
|
- `Callback::OnBlockGroupEnd()`
|
|
- `Callback::OnClusterEnd()`
|
|
- `Callback::OnTrackEntry()`
|
|
- `Callback::OnCuePoint()`
|
|
- `Callback::OnEditionEntry()`
|
|
- `Callback::OnTag()`
|
|
- `Callback::OnSegmentEnd()`
|
|
|
|
Only `Callback::OnFrame()` (and no other `Callback` methods) will be called in
|
|
between `Callback::OnSimpleBlockBegin()`/`Callback::OnSimpleBlockEnd()` or
|
|
`Callback::OnBlockBegin()`/`Callback::OnBlockEnd()`, since the SimpleBlock and
|
|
Block elements are not master elements only contain frames.
|
|
|
|
Note that seeking into the middle of the file may cause the parser to skip some
|
|
`*Begin()` methods. For example, if a seek is performed to a SimpleBlock
|
|
element, `Callback::OnSegmentBegin()` and `Callback::OnClusterBegin()` will not
|
|
be called. In this situation, the full sequence of callback events would be
|
|
(assuming the file ended after the SimpleBlock):
|
|
`Callback::OnSimpleBlockBegin()`, `Callback::OnFrame()` (for every frame in the
|
|
SimpleBlock), `Callback::OnSimpleBlockEnd()`, `Callback::OnClusterEnd()`, and
|
|
`Callback::OnSegmentEnd()`. Since the Cluster and Segment elements were skipped,
|
|
the `Cluster` DOM object may have some members marked as absent, and the
|
|
`*End()` events for the Cluster and Segment elements will have metadata with
|
|
unknown header position, header length, and body size (see `kUnknownHeaderSize`,
|
|
`kUnknownElementSize`, and `kUnknownElementPosition`).
|
|
|
|
When a `Callback` method has completed, it should return `Status::kOkCompleted`
|
|
to allow parsing to continue. If you would like parsing to stop, return any
|
|
other status code (except `Status::kEndOfFile`, since that's treated somewhat
|
|
specially and is intended for `Reader`s to use), which the parser will return.
|
|
If you return a non-parsing-error status code (.e.g. `Status::kOkPartial`,
|
|
`Status::kWouldBlock`, etc. or your own status code with a value > 0), parsing
|
|
may be resumed again. When parsing is resumed, the parser will call the same
|
|
callback method again (and once again, you may return `Status::kOkCompleted` to
|
|
let parsing continue or some other value to stop parsing).
|
|
|
|
You may subclass the `Callback` element and override methods which you are
|
|
interested in receiving events for. By default, methods taking an `Action`
|
|
parameter will set it to `Action::kRead` so the entire file is parsed. The
|
|
`Callback::OnFrame()` method will just skip over the frame bytes by default.
|
|
|
|
## `WebmParser`
|
|
|
|
The actual parsing work is done with `WebmParser`. Simply construct a
|
|
`WebmParser` and call `WebmParser::Feed()` (providing it a `Callback` and
|
|
`Reader` instance) to parse a file. It will return `Status::kOkCompleted` when
|
|
the entire file has been successfully parsed. `WebmParser::Feed()` doesn't store
|
|
any internal references to the `Callback` or `Reader`.
|
|
|
|
If you wish to start parsing from the middle of a file, call
|
|
`WebmParser::DidSeek()` before calling `WebmParser::Feed()` to prepare the
|
|
parser to receive data starting at an arbitrary point in the file. When seeking,
|
|
you should seek to the beginning of a WebM element; seeking to a location that
|
|
is not the start of a WebM element (e.g. seeking to a frame, rather than its
|
|
containing SimpleBlock/Block element) will cause parsing to fail. Calling
|
|
`WebmParser::DidSeek()` will reset the state of the parser and clear any
|
|
internal errors, so a `WebmParser` instance may be reused (even if it has
|
|
previously failed to parse a file).
|
|
|
|
## Building your program
|
|
|
|
The following program is a small program that completely parses a file from
|
|
stdin:
|
|
|
|
```.cc
|
|
#include <webm/callback.h>
|
|
#include <webm/file_reader.h>
|
|
#include <webm/webm_parser.h>
|
|
|
|
int main() {
|
|
video_webm_parser::Callback callback;
|
|
video_webm_parser::FileReader reader(std::freopen(nullptr, "rb", stdin));
|
|
video_webm_parser::WebmParser parser;
|
|
parser.Feed(&callback, &reader);
|
|
}
|
|
```
|
|
|
|
It completely parses the input file, but we need to make a new class that
|
|
derives from `Callback` if we want to receive any parsing events. So if we
|
|
change it to:
|
|
|
|
```.cc
|
|
#include <iomanip>
|
|
#include <iostream>
|
|
|
|
#include <webm/callback.h>
|
|
#include <webm/file_reader.h>
|
|
#include <webm/status.h>
|
|
#include <webm/webm_parser.h>
|
|
|
|
class MyCallback : public video_webm_parser::Callback {
|
|
public:
|
|
video_webm_parser::Status OnElementBegin(const video_webm_parser::ElementMetadata& metadata,
|
|
video_webm_parser::Action* action) override {
|
|
std::cout << "Element ID = 0x"
|
|
<< std::hex << static_cast<std::uint32_t>(metadata.id);
|
|
std::cout << std::dec; // Reset to decimal mode.
|
|
std::cout << " at position ";
|
|
if (metadata.position == video_webm_parser::kUnknownElementPosition) {
|
|
// The position will only be unknown if we've done a seek. But since we
|
|
// aren't seeking in this demo, this will never be the case. However, this
|
|
// if-statement is included for completeness.
|
|
std::cout << "<unknown>";
|
|
} else {
|
|
std::cout << metadata.position;
|
|
}
|
|
std::cout << " with header size ";
|
|
if (metadata.header_size == video_webm_parser::kUnknownHeaderSize) {
|
|
// The header size will only be unknown if we've done a seek. But since we
|
|
// aren't seeking in this demo, this will never be the case. However, this
|
|
// if-statement is included for completeness.
|
|
std::cout << "<unknown>";
|
|
} else {
|
|
std::cout << metadata.header_size;
|
|
}
|
|
std::cout << " and body size ";
|
|
if (metadata.size == video_webm_parser::kUnknownElementSize) {
|
|
// WebM master elements may have an unknown size, though this is rare.
|
|
std::cout << "<unknown>";
|
|
} else {
|
|
std::cout << metadata.size;
|
|
}
|
|
std::cout << '\n';
|
|
|
|
*action = video_webm_parser::Action::kRead;
|
|
return video_webm_parser::Status(video_webm_parser::Status::kOkCompleted);
|
|
}
|
|
};
|
|
|
|
int main() {
|
|
MyCallback callback;
|
|
video_webm_parser::FileReader reader(std::freopen(nullptr, "rb", stdin));
|
|
video_webm_parser::WebmParser parser;
|
|
video_webm_parser::Status status = parser.Feed(&callback, &reader);
|
|
if (status.completed_ok()) {
|
|
std::cout << "Parsing successfully completed\n";
|
|
} else {
|
|
std::cout << "Parsing failed with status code: " << status.code << '\n';
|
|
}
|
|
}
|
|
```
|
|
|
|
This will output information about every element in the entire file: it's ID,
|
|
position, header size, and body size. The status of the parse is also checked
|
|
and reported.
|
|
|
|
For a more complete example, see `demo/demo.cc`, which parses an entire file and
|
|
prints out all of its information. That example overrides every `Callback`
|
|
method to show exactly what information is available while parsing and how to
|
|
access it. The example is verbose, but that's primarily due to pretty-printing
|
|
and string formatting operations.
|
|
|
|
When compiling your program, add the `include` directory to your compiler's
|
|
header search paths and link to the compiled library. Be sure your compiler has
|
|
C++11 mode enabled (`-std=c++11` in clang++ or g++).
|
|
|
|
# Testing
|
|
|
|
Unit tests are located in the `tests` directory. Google Test and Google Mock are
|
|
used as testing frameworks. Building and running the tests will be supported in
|
|
the upcoming CMake scripts, but they can currently be built and run by manually
|
|
compiling them (and linking to Google Test and Google Mock).
|
|
|
|
# Fuzzing
|
|
|
|
The parser has been fuzzed with [AFL](http://lcamtuf.coredump.cx/afl/) and
|
|
[libFuzzer](http://llvm.org/docs/LibFuzzer.html). If you wish to fuzz the parser
|
|
with AFL or libFuzzer but don't want to write an executable that exercises the
|
|
parsing API, you may use `fuzzing/webm_fuzzer.cc`.
|
|
|
|
When compiling for fuzzing, define the macro
|
|
`WEBM_FUZZER_BYTE_ELEMENT_SIZE_LIMIT` to be some integer in order to limit the
|
|
maximum size of ASCII/UTF-8/binary elements. It's too easy for the fuzzer to
|
|
generate elements that claim to have a ridiculously massive size, which will
|
|
cause allocations to fail or the program to allocate too much memory. AFL will
|
|
terminate the process if it allocates too much memory (by default, 50 MB), and
|
|
the [Address Sanitizer doesn't throw `std::bad_alloc` when an allocation fails]
|
|
(https://github.com/google/sanitizers/issues/295). Defining
|
|
`WEBM_FUZZER_BYTE_ELEMENT_SIZE_LIMIT` to a low number (say, 1024) will cause the
|
|
ASCII/UTF-8/binary element parsers to return `Status::kNotEnoughMemory` if the
|
|
element's size exceeds `WEBM_FUZZER_BYTE_ELEMENT_SIZE_LIMIT`, which will avoid
|
|
false positives when fuzzing. The parser expects `std::string` and `std::vector`
|
|
to throw `std::bad_alloc` when an allocation fails, which doesn't necessarily
|
|
happen due to the fuzzers' limitations.
|
|
|
|
You may also define the macro `WEBM_FUZZER_SEEK_FIRST` to have
|
|
`fuzzing/webm_fuzzer.cc` call `WebmParser::DidSeek()` before doing any parsing.
|
|
This will test the seeking code paths. |