Commit Graph

1266 Commits

Author SHA1 Message Date
Dmitry Kovalev
1f6e95e76a Merge "Removing redundant struct from union b_mode_info." 2013-07-02 18:09:31 -07:00
Dmitry Kovalev
be77f6bbbf Removing redundant struct from union b_mode_info.
Change-Id: I08fc6e474ff2c12cfa065bae4989c724276e2c83
2013-07-02 16:51:57 -07:00
Dmitry Kovalev
edb060a77c Adding write_selected_txfm_size function.
Change-Id: I143b430b7c24a964ccd0ebb75944cf317a072214
2013-07-02 16:41:22 -07:00
Yaowu Xu
0d7b7c09cb Added a speed feature use_square_partition_only
This commit adds a speed feature where only squared partition are
evaluated in partition picking. Enable this feature in cpu-used 2
reduces encoding time by ~30%.

loss of compression:
-0.9% on cif set
-1.23% on stdhd

Change-Id: Ia6fad11210f0b78365abb889f9245604513be5b9
2013-07-02 16:40:15 -07:00
Ronald S. Bultje
e5fb4b61b6 Use pmovmskb to skip quantize loops over empty coefficients.
If none of the 16 coefficients that we quantize per loop iteration
are larger than the zbin, directly skip to the next round of coeffs,
rather than doing a full quantize loop that will eventually result
in 16 zeroes. This incurs a jump cost, but saves a lot of other work.
32x32 quant goes from 1349 -> 1184 cycles. The same approach yielded
no significantly positive results for smaller transforms, so is not
used there (8x8: 103 -> 101 cycles; 16x16: 302 -> 306 cycles).

Change-Id: I8fca17dc2543fc8eed1dbcd5100145e3c3a9b647
2013-07-02 16:34:24 -07:00
Deb Mukherjee
37501d687c Speed feature to binary search dir intramodes
This speed feature will skip searching the directional intra prediction
modes D63, D117, D27, D153 if the best intra mode so far is not one of
the diagonal, horizontal or vertical directions closest to the respective
directions being tested. In other words, this implements a sort of
binary search in the angular domain.

Speedup: about 9-10%
Results: -0.05% only on derfraw300.

Change-Id: I413584c41f2a3e8dabfbdeb40718c8fc4b1d63a2
2013-07-02 14:07:19 -07:00
Deb Mukherjee
66324d501a Merge "Clean-up in forward update to use mapping tables" 2013-07-02 14:02:53 -07:00
Deb Mukherjee
8d3d2b76f3 Tx size selection enhancements
(1) Refines the modeling function and uses that to add some speed
features. Specifically, intead of using a flag use_largest_txfm as
a speed feature, an enum tx_size_search_method is used, of which
two of the types are USE_FULL_RD and USE_LARGESTALL. Two other
new types are added:
USE_LARGESTINTRA (use largest only for intra)
USE_LARGESTINTRA_MODELINTER (use largest for intra, and model for
inter)

(2) Another change is that the framework for deciding transform type
is simplified to use a heuristic count based method rather than
an rd based method using txfm_cache. In practice the new method
is found to work just as well - with derf only -0.01 down.
The new method is more compatible with the new framework where
certain rd costs are based on full rd and certain others are
based on modeled rd or are not computed. In this patch the existing
rd based method is still kept for use in the USE_FULL_RD mode.
In the other modes, the count based method is used.
However the recommendation is to remove it eventually since the
benefit is limited, and will remove a lot of complications in
the code

(3) Finally a bug is fixed with the existing use_largest_txfm speed feature
that causes mismatches when the lossless mode and 4x4 WH transform is
forced.

Results on derf:
USE_FULL_RD: +0.03% (due to change in the tables), 0% encode time reduction
USE_LARGESTINTRA: -0.21%, 15% encode time reduction (this one is a
pretty good compromise)
USE_LARGESTINTRA_MODELINTER: -0.98%, 22% encode time reduction
(currently the benefit of modeling is limited for txfm size selection,
but keeping this enum as a placeholder) .
USE_LARGESTALL: -1.05%, 27% encode-time reduction (same as existing
use_largest_txfm speed feature).

Change-Id: I4d60a5f9ce78fbc90cddf2f97ed91d8bc0d4f936
2013-07-02 13:54:00 -07:00
Deb Mukherjee
9c20cedd93 Clean-up in forward update to use mapping tables
Uses mapping tables instead of complicated modulo/division
operations for prob mapping for forward updates.

No bit-stream or output change.

Change-Id: Ifd9ce8ac1437835c305c94f64c18273c7a68f546
2013-07-02 12:48:20 -07:00
Ronald S. Bultje
3cc6eb7c00 Merge "Make get_coef_context() branchless." 2013-07-02 11:48:15 -07:00
Dmitry Kovalev
3140c443e4 Merge "Removing vp9_mbpitch.c, moving vp9_setup_block_dptrs to vp9_block.h." 2013-07-02 11:31:35 -07:00
Yunqing Wang
f4bee75c2b Merge "Add speed feature to disable splitmv" 2013-07-02 10:54:22 -07:00
Yunqing Wang
b12e060b55 Add speed feature to disable splitmv
Added a speed feature in speed 1 to disable splitmv for HD (>=720)
clips. Test result on stdhd set: 0.3% psnr loss and 0.07% ssim
loss. Encoding speedup is 36%.

(For reference: The test result on derf set showed 2% psnr loss
and 1.6% ssim loss. Encoding speedup is 34%. SPLITMV should be
enabled for small resolution videos.)

Change-Id: I54f72b94f506c6d404b47c42e71acaa5374d6ee6
2013-07-02 10:02:34 -07:00
Jingning Han
b91a1586a3 Calculate rd cost per transformed block
Compute the rate-distortion cost per transformed block, and cumulate
the cost through all blocks inside a partition. This allows encoder
to detect if the cumulative rd cost is already above the best rd cost,
thereby enabling early termination in the rate-distortion optimization
search.

Change-Id: I0a856367a9a7b6dd0b466e7b767f54d5018d09ac
2013-07-02 09:58:46 -07:00
Ronald S. Bultje
9df24b41ca Merge "Update quantize SSSE3 SIMD to cover 32x32 transform case also." 2013-07-02 09:38:08 -07:00
Paul Wilkins
1319d9c077 Adjust Speed 0 settings.
Remove the use of sf->comp_inter_joint_search_thresh
from the baseline speed 0. Approx +0.4% on derf.

Change-Id: Icc14db98909830f40e5ac66130d40e78d2e55c71
2013-07-02 15:42:14 +01:00
Paul Wilkins
b7cd01ed73 Revert "New motion threshold factor - speed feature."
This reverts commit 1377278180.
Also fixes a spelling mistake.

Change-Id: I5be8aa4d8d3c0323d4a6f41968a7b2c048949c3f
2013-07-02 15:06:40 +01:00
Yaowu Xu
9e408e3504 fix the mismatch again in cpu_used 2
Change-Id: Icc4f70f0b0f91c9e7d5d00eedd67841afe2f2679
2013-07-01 19:13:18 -07:00
Jim Bankoski
d4158283e7 use partitioning from last frame
This cl converts use partition from last frame to do the following:

if part is none,horz, vert -> try split
if part != none and one of the children is not split - try none


Change-Id: I5b6c659e35f3ac9f11c051b92ba98af6d7e8aa87
Signed-off-by: Jim Bankoski <jimbankoski@google.com>
2013-07-01 18:18:50 -07:00
Dmitry Kovalev
1ac0540296 Removing vp9_mbpitch.c, moving vp9_setup_block_dptrs to vp9_block.h.
Change-Id: Ia547a5dd7650b771fd00edd673ab9f920270731c
2013-07-01 17:28:08 -07:00
Ronald S. Bultje
26b6318de8 Make get_coef_context() branchless.
This should significantly speedup cost_coeffs(). Basically what the
patch does is to make the neighbour arrays padded by one item to
prevent an eob check in get_coef_context(), then it populates each
col/row scan and left/top edge coefficient with two times the same
neighbour - this prevents a single/double context branch in
get_coef_context(). Lastly, it populates neighbour arrays in pixel
order (rather than scan order), so we don't have to dereference the
scantable to get the correct neighbours.

Total encoding time of first 50 frames of bus (speed 0) at 1500kbps
goes from 2min10.1 to 2min5.3, i.e. a 2.6% overall speed increase.

Change-Id: I42bcd2210fd7bec03767ef0e2945a665b851df56
2013-07-01 16:34:10 -07:00
Yaowu Xu
ba3b2604f0 Merge "Quantize (64-bit only, for now) SSSE3 SIMD." 2013-07-01 15:58:57 -07:00
Dmitry Kovalev
6411228aca Merge "Removing vp9_modecont.{h, c}." 2013-07-01 14:58:48 -07:00
Dmitry Kovalev
d9db0d96ec Merge "Moving encoder subexp encoding functions to subexp.{h, c}." 2013-07-01 14:58:36 -07:00
Ronald S. Bultje
c8defcfdee Update quantize SSSE3 SIMD to cover 32x32 transform case also.
Encode time of bus (speed 0) 50 frames @ 1500kbps goes from 2min14.4 to
2min10.1, i.e. a 2.3% overall speed increase.

Change-Id: I3699580e74ec26c7d24e03681bc47ba25ee1ee87
2013-07-01 11:36:33 -07:00
Ronald S. Bultje
7353ceab9d Quantize (64-bit only, for now) SSSE3 SIMD.
Total encoding time for first 50 frames of bus (speed 0) @ 1500kbps
goes 2min34.8 to 2min14.4, i.e. a 10.4% overall speedup. The code is
x86-64 only, it needs some minor modifications to be 32bit compatible,
because it uses 15 xmm registers, whereas 32bit only has 8.

Change-Id: I2df53770c2e850813ffa713e1a91b45b0082b904
2013-07-01 11:36:07 -07:00
Dmitry Kovalev
2ab3bc8871 Removing vp9_modecont.{h, c}.
Moving vp9_default_inter_mode_probs array to vp9_entropymode.c.

Change-Id: I88ebda86ccc07f2a43c6c01d4b37898214cfb6de
2013-07-01 10:17:15 -07:00
Paul Wilkins
7bb436feee Merge "New motion threshold factor - speed feature." 2013-07-01 09:39:02 -07:00
Yaowu Xu
632289b31f fix a mismatch in cpuused 2
Change-Id: I921c9faba6386535aaf717a54301dd346a9b8540
2013-07-01 08:54:50 -07:00
Paul Wilkins
1377278180 New motion threshold factor - speed feature.
Added a speed feature that focuses only on thresholds
for new motion modes.

Moved sf->comp_inter_joint_search_thresh into speed
1.  This has ~+0.4% impact on quality at speed 0 as
our quality reference baseline.

Slight adjustment to baseline thresholds.

Change-Id: I7ebf104f1fe29af77ed4837b2e84be065621bbe5
2013-07-01 12:11:21 +01:00
Jingning Han
993942ce0c Merge "Enable SSE2 4x4 ADST/DCT transform" 2013-06-29 15:57:04 -07:00
Christian Duvivier
466e0cf303 SSE2 version of vp9_short_fdct32x32_rd.
43,000 -> 5,750 cycles, about 7.5x faster.

Change-Id: Ibfd92821b9603f4ed9c256e0ececec14fa4565d0
2013-06-29 13:53:00 -07:00
Dmitry Kovalev
bb8ccf1caf Moving encoder subexp encoding functions to subexp.{h, c}.
Change-Id: I83ca53bf6def871f199a382a671f26ad7cbecbca
2013-06-29 11:50:45 -07:00
Ronald S. Bultje
bc70c60b25 Merge "fixed a bug where sse is not populated" 2013-06-29 07:42:41 -07:00
Ronald S. Bultje
a487af8d35 Merge "Inline vp9_get_coef_context() (and remove vp9_ prefix)." 2013-06-28 19:37:11 -07:00
Ronald S. Bultje
7731e53839 Merge "Minor change to prevent one level of dereference in cost_coeffs()." 2013-06-28 19:36:56 -07:00
Jingning Han
1109b6b888 Enable SSE2 4x4 ADST/DCT transform
This commit enables SSE2 4x4 foward hybrid transform. The runtime
goes from 249 cycles down to 74 cycles. Overall around 2% speed-up
at no compression performance change.

Change-Id: Iad4d526346e05c7be896466c05500711bb763660
2013-06-28 17:24:43 -07:00
Yaowu Xu
f853e662b7 fixed a bug where sse is not populated
Change-Id: I692d800af1f976c84a76f8bd66864c4b39540abc
2013-06-28 17:10:22 -07:00
Jingning Han
07b72ace70 Merge "Fix switch statement in 8x8 transform" 2013-06-28 16:49:59 -07:00
Dmitry Kovalev
59070f6e3c Merge "Removing CONFIG_DEBUG checks on assertions." 2013-06-28 14:03:28 -07:00
Jingning Han
9def7f72a0 Fix switch statement in 8x8 transform
Change-Id: I7c46354c4983feb5f6202c3ab4a1d9534da7e30f
2013-06-28 13:40:36 -07:00
Ronald S. Bultje
cee3bc6ffa Merge "Some minor optimizations for cost_coeffs()." 2013-06-28 11:54:50 -07:00
Ronald S. Bultje
ec5d09b950 Merge "Make coefficient skip condition an explicit RD choice." 2013-06-28 11:54:28 -07:00
Ronald S. Bultje
d00b8e5f82 Inline vp9_get_coef_context() (and remove vp9_ prefix).
Makes cost_coeffs() a lot faster:
4x4: 236 -> 181 cycles
8x8: 888 -> 588 cycles
16x16: 3550 -> 2483 cycles
32x32: 17392 -> 12010 cycles

Total encode time of first 50 frames of bus (speed 0) @ 1500kbps goes
from 2min51.6 to 2min43.9, i.e. 4.7% overall speedup.

Change-Id: I16b8d595946393c8dc661599550b3f37f5718896
2013-06-28 10:40:21 -07:00
Dmitry Kovalev
0345fc3ad9 Merge "Decoder's code cleanup." 2013-06-28 10:38:54 -07:00
Dmitry Kovalev
8e6ce6bb9e Removing CONFIG_DEBUG checks on assertions.
Adding CHECK_MEM_ERROR macro to vp9_common.h and removing two duplicated
ones from vp9_onyx_int.h and vp9_onyxd_int.h.

Change-Id: I916afec61b3019f18193135dac7c35ed0f89b8b6
2013-06-28 10:36:20 -07:00
Ronald S. Bultje
e3ce2b2ab3 Minor change to prevent one level of dereference in cost_coeffs().
4x4: 234 -> 236 cycles
8x8: 878 -> 888 cycles
16x16: 3664 -> 3550 cycles
32x32: 18134 -> 17392 cycles

Change-Id: I37a51bfbb0060a3a54f09c6045c14a989811ed78
2013-06-28 10:29:07 -07:00
Ronald S. Bultje
91d223bd5c Some minor optimizations for cost_coeffs().
Cycle timings for first 3 frames of bus (speed 0) at 1500kbps:
4x4: 298 -> 234 cycles
8x8: 1227 -> 878 cycles
16x16: 23426 -> 18134 cycles
32x32: 4906 -> 3664 cycles

Total encode time of first 50 frames of bus @ 1500kbps (speed 0) goes
from 3min0.7 to 2min51.6 seconds, i.e. 5.3% faster.

Change-Id: I68a0e1b530b0563b84a67342cca4b45146077e95
2013-06-28 10:29:02 -07:00
Ronald S. Bultje
af660715c0 Make coefficient skip condition an explicit RD choice.
This commit replaces zrun_zbin_boost, a method of biasing non-zero
coefficients following runs of zero-coefficients to be rounded towards
zero, with an explicit skip-block choice in the RD loop.

The logic is basically that if individual coefficients should be rounded
towards zero (from a RD point of view), the trellis/optimize loop should
take care of it. If whole blocks should be zero (from a RD point of
view), a single RD check is much more efficient than a complete
serialization of the quantization loop.

Quality change: derf +0.5% psnr, +1.6% ssim; yt +0.6% psnr, +1.1% ssim.
SIMD for quantize will follow in a separate patch. Results for other
test sets pending.

Change-Id: Ife5fa641163ac5150ac428011e87188f1937c1f4
2013-06-28 10:28:49 -07:00
Yaowu Xu
1b5421f3c5 Merge "Minor cleanups" 2013-06-28 09:56:11 -07:00
Yaowu Xu
64bb996e03 Merge "Optimize partition search order" 2013-06-28 09:29:39 -07:00
Yaowu Xu
8b9eea0a34 Minor cleanups
Change-Id: I379617c1c731a686b3f7e032b8805860c1055b12
2013-06-28 09:19:50 -07:00
Yaowu Xu
1374a06bd8 Optimize partition search order
This commit change the partition search order to allow checking of
rectangular partition to be done after square partitions. It also
added a speed feature to skip rectangular partition check when
NONE is better than SPLIT in RD sense.

This feature roughly speed up encoder by 1.5X with loss on compression
-0.91% on cif set
-0.56% on stdhd set

Change-Id: I0d2d06993041aa9ea9073fcc39c54f73a127dfa4
2013-06-28 07:13:54 -07:00
Ronald S. Bultje
fd4eed3b08 Fix tile independence with both column tiling and static_thresh set.
Change-Id: I0b2be0ec2c410a527f88b95a44f24ac967b2dac1
2013-06-27 21:56:40 -07:00
Dmitry Kovalev
3231da0a9e Decoder's code cleanup.
Using vp9_set_pred_flag function instead of custom code, adding
decode_tokens function which is now called from decode_atom,
decode_sb_intra, and decode_sb.

Change-Id: Ie163a7106c0241099da9c5fe03069bd71f9d9ff8
2013-06-27 16:15:43 -07:00
Dmitry Kovalev
a3664258c5 Merge "General cleanup in segmentation-related code." 2013-06-27 14:57:07 -07:00
Dmitry Kovalev
be83ef3104 Merge "Moving subexp encoding functions in separate vp9_dsubexp.c file." 2013-06-27 14:55:18 -07:00
Ronald S. Bultje
7a049be6bf Inline quantize so idiv instruction gets removed from inner loop.
Encoding time of first 50 frames of bus @ 1500kbps (speed 0) goes from
3min15.0 to 3min10.9, i.e. 2.1% faster overall.

Change-Id: If592ee99be09bcd34a7c8498347f44e7305e982c
2013-06-27 09:01:04 -07:00
Paul Wilkins
05ffdf2625 Merge "Auto adapt step size feature." 2013-06-27 02:28:41 -07:00
Paul Wilkins
59af9049d3 Merge "Start adaptive threshold for each mode at max." 2013-06-27 02:28:36 -07:00
Paul Wilkins
5bcf069c6b Merge "Change meaning of cpi->sf.first_step and rename." 2013-06-27 02:28:21 -07:00
Jingning Han
fc1cfd8e32 Merge "Make intra predictor reference buffer configurable" 2013-06-26 19:02:02 -07:00
Jingning Han
861cb06c67 Make intra predictor reference buffer configurable
This commit enables configurable reference buffer pointer for intra
predictor. This allows later removal of spatial dependency between
blocks inside a 64x64 superblock in the rate-distortion optimization
loop.

Change-Id: I02418c2077efe19adc86e046a6b49364a980f5b1
2013-06-26 17:17:21 -07:00
Jingning Han
a9e7243d1a Merge "Remove empty function vp9_build_block_offsets" 2013-06-26 17:06:56 -07:00
Ronald S. Bultje
b5468155b6 Remove unused macro RDTRUNC_8x8 from encodemb.c.
Change-Id: I0c097567adab24215d807963ccb34810a2afe007
2013-06-26 15:34:01 -07:00
Jingning Han
bd9bac0391 Remove empty function vp9_build_block_offsets
This function is empty, hence is removed.

Change-Id: Ia9d01710806bffe0398a6dc9405f8a5a81b27d74
2013-06-26 14:55:47 -07:00
Paul Wilkins
f36f66cd37 Merge "fixed a compiling problem with MSVC win32 build" 2013-06-26 11:58:16 -07:00
Paul Wilkins
9f3ab83486 Auto adapt step size feature.
Also tweaks to other features and experiments with
what is on and off at different speed settings.

Change-Id: I3e1d0be0d195216bf17c2ac5df67f34ce0b306b2
2013-06-26 19:48:39 +01:00
Dmitry Kovalev
be07485e9a General cleanup in segmentation-related code.
Using consistent function and variable names.

Change-Id: I2deb3fded8797453a2081836c9ce2e79ade06eb7
2013-06-26 10:27:28 -07:00
Dmitry Kovalev
49dee16879 Merge "Using get_plane_block_{width, height} instead of custom code." 2013-06-26 10:23:27 -07:00
Yaowu Xu
60dc7375af fixed a compiling problem with MSVC win32 build
The aligned array in parameter list caused win32 build to report
c2719 error. This commit fixed the issue by make the parameter
type a pointer instead of an array.

Change-Id: I4ed654ce4eba2db4995d9cdc136c68e9a6acc992
2013-06-26 09:33:16 -07:00
Paul Wilkins
689957e3ad Start adaptive threshold for each mode at max.
Each frame we reset all adaptive thresholds to MAX
rather than base. As modes are picked their thresholds
drop down.

Change-Id: Ia37f03a73003c2d9bfcda57edea07205e9a0e5e8
2013-06-26 17:04:47 +01:00
Paul Wilkins
e606cac046 Change meaning of cpi->sf.first_step and rename.
Renamed cpi->sf.first_step to cpi->sf.reduce_first_step_size
and changed its meaning such that it is a delta applied to
reduce the default first step size (>> x) in the motion search
rather than an absolute value.

The default first step size is already changed according to the image
dimensions (smaller for smaller images). cpi->sf.reduce_first_step_size
now applies a further correction from the default.

Change-Id: Ia94e08bc24c67b604831f980909af7e982fcd16d
2013-06-26 17:04:06 +01:00
John Koleszar
8137e24f3d Merge "Move vp9_counts_to_nmv_context to encoder" 2013-06-25 22:44:21 -07:00
John Koleszar
7bbb0633cd Merge "Move vp9_full_to_model_counts to encoder" 2013-06-25 22:44:16 -07:00
Jingning Han
3cc8c8c3a0 Merge "Refactor intra predictor block" 2013-06-25 19:46:55 -07:00
Jingning Han
d19ea3861d Refactor intra predictor block
Remove vp9_intra4x4_predict(). Use the common intra prediction
function for all block sizes.

Change-Id: Ibd19d51dfa3da8bbdfb79ddeb81530b2e2089560
2013-06-25 16:33:13 -07:00
Dmitry Kovalev
6fb10f2de4 Renaming "nmv" to "mv".
Change-Id: I8299f55c3b930221e52c2237f2ddea65b94fd33b
2013-06-25 15:19:18 -07:00
Dmitry Kovalev
dc0f457c94 Using get_plane_block_{width, height} instead of custom code.
Change-Id: I453ed11b965e857a14c18ea5c0f4a0a48e7dc0d9
2013-06-25 14:11:18 -07:00
Ronald S. Bultje
0441e0a2fc Merge "Only do metrics on cropped (visible) area of picture." 2013-06-25 13:51:18 -07:00
Ronald S. Bultje
1d0ae2e63c Merge "Don't skip right/bottom border pixels in SSIM calculations." 2013-06-25 13:51:04 -07:00
Ronald S. Bultje
c5be54eef3 Merge "Add averaging-SAD functions for 8-point comp-inter motion search." 2013-06-25 13:50:53 -07:00
Jingning Han
d52c359d43 Merge "Tune the rounding operations in 8x8 ADST/DCT sse2" 2013-06-25 13:17:05 -07:00
Ronald S. Bultje
450c7b57a8 Only do metrics on cropped (visible) area of picture.
The part where we align it by 8 or 16 is an implementation detail that
shouldn't matter to the outside world.

Change-Id: I9edd6f08b51b31c839c0ea91f767640bccb08d53
2013-06-25 12:57:28 -07:00
Ronald S. Bultje
44f349df62 Don't skip right/bottom border pixels in SSIM calculations.
Change-Id: I75acb55ade54bef6ad7703ed5e691581fa2f8fe1
2013-06-25 12:57:28 -07:00
Ronald S. Bultje
c24d922396 Add averaging-SAD functions for 8-point comp-inter motion search.
Makes first 50 frames of bus @ 1500kbps encode from 3min22.7 to 3min18.2,
i.e. 2.3% faster. In addition, use the sub_pixel_avg functions to calc
the variance of the averaging predictor. This is slightly suboptimal
because the function is subpixel-position-aware, but it will (at least
for the SSE2 version) not actually use a bilinear filter for a full-pixel
position, thus leading to approximately the same performance compared to
if we implemented an actual average-aware full-pixel variance function.
That gains another 0.3 seconds (i.e. encode time goes to 3min17.4), thus
leading to a total gain of 2.7%.

Change-Id: I3f059d2b04243921868cfed2568d4fa65d7b5acd
2013-06-25 12:57:28 -07:00
Jingning Han
0084e61d5f Tune the rounding operations in 8x8 ADST/DCT sse2
Improve the round-trip precision to meet the unit test setttings.

Change-Id: I303febae56b4b990ea3798b8ebed94c0510ecf79
2013-06-25 12:02:26 -07:00
Ronald S. Bultje
5ebe47747d Merge "Don't re-allocate comp_pred buffers for each call to comp motion search." 2013-06-25 12:00:36 -07:00
Dmitry Kovalev
9467571777 Moving subexp encoding functions in separate vp9_dsubexp.c file.
Change-Id: Idbb2ea80f764fa830fe2ddcfc54ef7fe232f05a8
2013-06-25 11:53:17 -07:00
Dmitry Kovalev
5ae096778e Merge "Removing unused code." 2013-06-25 11:50:55 -07:00
Jingning Han
cd6932db77 Merge "Add 8x8 dct/adst unit tests" 2013-06-25 11:21:17 -07:00
Dmitry Kovalev
87ee34aacb Removing unused code.
Removing block index (ib) parameter from get_tx_type_{8x8, 16x16}
functions.

Change-Id: Ia213335aae7a7cb027f97b9cc9b04519840250f1
2013-06-25 10:17:19 -07:00
Dmitry Kovalev
70e9622185 Merge "Removing find_seg_id and using vp9_get_pred_mi_segid instead." 2013-06-25 10:16:06 -07:00
Dmitry Kovalev
529679bd52 Merge "Transforming scale_mv_component_q4 into scale_mv_q4 function." 2013-06-25 10:15:33 -07:00
Jingning Han
ab362621fe Add 8x8 dct/adst unit tests
This commit enables 8x8 DCT and hybrid transform unit tests. It
also tunes the forward hybrid transform rounding opertions for
more precise round-trip performance.

Change-Id: If05c1ce59d75d641b9c6c91527d02d3a6ef498c3
2013-06-25 09:57:01 -07:00
Jingning Han
67365520e7 Merge "Use aligned buffer operations in 8x8/16x16 2D-DCT" 2013-06-25 09:49:03 -07:00
Yaowu Xu
b9c934df8e Merge "Enable sse2 implmentation of 8x8 ADST/DCT" 2013-06-25 09:13:22 -07:00
Jingning Han
82d504b50f Use aligned buffer operations in 8x8/16x16 2D-DCT
This reduces 16x16 2D-DCT runtime from 865 cycles to 837 cycles.

Change-Id: I137758b81cd127b936175284310e81378db64552
2013-06-24 19:56:23 -07:00
Jingning Han
a32a086d23 Enable sse2 implmentation of 8x8 ADST/DCT
This commit makes use of the butterfly structure to enable the sse2
version implementation of 8x8 ADST/DCT hybrid transform coding.

The runtime of hybrid transform module goes down from 1170 cycles
to 245 cycles. Overall speed-up around 1.5%.

Change-Id: Ic808ffd21ece8a9d0410d8c0243d7b6c28ac3b3f
2013-06-24 18:41:33 -07:00
Yaowu Xu
e371cd73a3 change to enable use_largest_txform feature
for all regular inter frames at speed 1

Change-Id: I0a8b301273ecf2b8730ab1f6b7a05f89f4d498e0
2013-06-24 16:43:26 -07:00
John Koleszar
4ecd6dbead Move vp9_counts_to_nmv_context to encoder
This function only used from within vp9_encodemv.c.

Change-Id: Ib3fc7c30b1e2d27321397ac474cbc8976bc1f4b1
2013-06-24 15:58:18 -07:00
John Koleszar
08b1798ae7 Move vp9_full_to_model_counts to encoder
This function is not called from the decoder, so it doesn't need to be
in common/.

Change-Id: I6977dd462a25b4ff39c9c7e1b0b5b16aa58ee733
2013-06-24 15:46:15 -07:00
Ronald S. Bultje
4dc70fa7f9 Don't re-allocate comp_pred buffers for each call to comp motion search.
Instead, just allocate a few bytes on the stack, this is 4k, which isn't
all that much.

Change-Id: I82af6ee89e6ed01faaa23ff891ee7ced76df8c16
2013-06-24 14:05:13 -07:00
Dmitry Kovalev
f27f76dfb3 Transforming scale_mv_component_q4 into scale_mv_q4 function.
Using MV instead of int_mv for function arguments.

Change-Id: Ic25e13dccbc98fac1fa1b3255127e00cca2a57f6
2013-06-21 15:34:29 -07:00
Ronald S. Bultje
fc033b38ee Remove emms - that shouldn't be there.
Change-Id: I8fcab81e390f93dc17e9666bbf8f77883b5aa897
2013-06-21 14:45:04 -07:00
Dmitry Kovalev
40141681c0 Removing find_seg_id and using vp9_get_pred_mi_segid instead.
Change-Id: Ia40229903c08f14020e90e94cfdf494aba1be827
2013-06-21 13:05:10 -07:00
Ronald S. Bultje
ba42c02654 Add missing SECTION .text marker in assembly file.
Fixes a crash on Windows when building with MSVC.

Change-Id: I124ac756a1be55d190fadda5fcc46d23b1445dbf
2013-06-21 12:55:46 -07:00
Ronald S. Bultje
54b2a59623 Implement SSE2 block_error.
Change vp9_block_error() to return a 64bit error variable, change all
callers to expect a 64bit return value (this will prevent overflows,
which we basically don't check for at all right now). Remove duplicate
block_error() function, which fixed that through truncation. Remove
old (incompatible) mmx/sse2 block_error SIMD versions and replace with
a new one that returns a 64bit value.

Encoding time of first 50 frames of bus @ 1500kbps goes from 3min29 to
3min23, i.e. a 3% overall speedup.

Change-Id: Ib71ac5508b5ee8a80f1753cd85d72df1629abe68
2013-06-21 12:54:52 -07:00
Ronald S. Bultje
7756e9892b Merge "Add subtract_block SSE2 version and unit test." 2013-06-21 12:49:50 -07:00
Ronald S. Bultje
9a480482cb Merge "SSE2/SSSE3 optimizations and unit test for sub_pixel_avg_variance()." 2013-06-21 12:49:43 -07:00
Ronald S. Bultje
25c588b1e4 Add subtract_block SSE2 version and unit test.
3% faster overall (3min35.0 to 3min28.5).

Change-Id: I5ff8a5c2c91586b6632ca5009ad1ea51ce94af5e
2013-06-21 09:35:37 -07:00
Yaowu Xu
869d770610 Merge "Get some speed back for cpuused 1" 2013-06-20 22:37:01 -07:00
Yaowu Xu
45e25a7814 Get some speed back for cpuused 1
and remove unused code.

Change-Id: If380440c4450294b5450b7a9eeb94a376846ec01
2013-06-20 19:05:18 -07:00
Yaowu Xu
61721181ec Merge "rename variables to avoid build error in MSVC" 2013-06-20 19:04:30 -07:00
Yaowu Xu
ee07a261a0 rename variables to avoid build error in MSVC
Change-Id: I7960178c95c54d5c4497e44cfc8c493566294b34
2013-06-20 18:31:48 -07:00
Yaowu Xu
e6cd5ed307 Merge "Implement sse2 and ssse3 versions for all sub_pixel_variance sizes." 2013-06-20 17:42:50 -07:00
Ronald S. Bultje
1e6a32f1af SSE2/SSSE3 optimizations and unit test for sub_pixel_avg_variance().
Encoding of bus @ 1500kbps (first 50 frames) goes from 3min57 to
3min35, i.e. approximately a 10.5% speedup. Note that the SIMD versions
which use a bilinear filter (x_offset & 7 || y_offset & 7) aren't
perfectly interleaved, and can probably be improved further in the
future. I've marked this with a few TODOs/FIXMEs in the code.

Change-Id: I5c9e900c0f0d32e431a50fecae213b510b2549f9
2013-06-20 15:59:48 -07:00
Dmitry Kovalev
8283d893eb Merge "Renaming 'nmv' to 'mv' for several functions." 2013-06-20 10:17:12 -07:00
Deb Mukherjee
7947a33d72 Improving model rd with variance and quant step
Improves the rd modeling function and implements them using interpolation
from a table which is a little faster. Also uses sse as input to the
modeling function rather than var - since there is no dc prediction
used and as a result the sse works a little better.

derfraw300: +0.05%
Speedup: ~1%

Change-Id: I151353c6451e0e8fe3ae18ab9842f8f67e5151ff
2013-06-20 10:06:28 -07:00
Jim Bankoski
9f2a1ae23e adds force partitioning greater than or less than block size
adds a new speed feature to force partitioning to be greater than
or less than a certain size

Change-Id: I8c048eeeef93700ae822eccf98f8751a45b2e7d0
2013-06-20 09:51:42 -07:00
Jim Bankoski
18bdf708e7 adds a set partitioning to speed features
this feature lets you set a partitioning size to be used by the entire
frame.

Change-Id: I208a4c8c701375cbb054418266f677768b6f8f06
2013-06-20 09:50:44 -07:00
Jim Bankoski
476d73d294 partition by variance using var from last frame
This uses variance to split partition. Variance is calculated using
nearest mv,  always from last ref frame.

Change-Id: Idd015b4a9aa3bc82591759eac239680c07496896
2013-06-20 09:48:22 -07:00
Jim Bankoski
1f94b97694 convert all speed things to speed features
Change-Id: Ie24489a4d39f3e53e816eeebf75a1c9c7d94515a
2013-06-20 09:42:44 -07:00
Jim Bankoski
727fa7b1e4 new partition via variance
Change-Id: Ideee45cad8b38087c509cd404484728e85d0c427
2013-06-20 09:42:05 -07:00
Jim Bankoski
0fad6a9d99 fix to set up new speed feature
This uses the speed feature functionality for code.

Change-Id: I9cd16c0c5f98520ae27ebba81aa2c178546587f8
2013-06-20 09:35:02 -07:00
Jim Bankoski
df2314cfdd don't copy partitions for key frames or altrefs
force us to go through slow partitioning for keyframes, altref and
overlays.

Change-Id: I1a286361bf74083e71973575a7296be46eb98742
2013-06-20 09:34:32 -07:00
Ronald S. Bultje
8fb6c58191 Implement sse2 and ssse3 versions for all sub_pixel_variance sizes.
Overall speedup around 5% (bus @ 1500kbps first 50 frames 4min10 ->
3min58). Specific changes to timings for each function compared to
original assembly-optimized versions (or just new version timings if
no previous assembly-optimized version was available):

sse2   4x4:    99 ->   82 cycles
sse2   4x8:           128 cycles
sse2   8x4:           121 cycles
sse2   8x8:   149 ->  129 cycles
sse2   8x16:  235 ->  245 cycles (?)
sse2  16x8:   269 ->  203 cycles
sse2  16x16:  441 ->  349 cycles
sse2  16x32:          641 cycles
sse2  32x16:          643 cycles
sse2  32x32: 1733 -> 1154 cycles
sse2  32x64:         2247 cycles
sse2  64x32:         2323 cycles
sse2  64x64: 6984 -> 4442 cycles

ssse3  4x4:           100 cycles (?)
ssse3  4x8:           103 cycles
ssse3  8x4:            71 cycles
ssse3  8x8:           147 cycles
ssse3  8x16:          158 cycles
ssse3 16x8:   188 ->  162 cycles
ssse3 16x16:  316 ->  273 cycles
ssse3 16x32:          535 cycles
ssse3 32x16:          564 cycles
ssse3 32x32:          973 cycles
ssse3 32x64:         1930 cycles
ssse3 64x32:         1922 cycles
ssse3 64x64:         3760 cycles

Change-Id: I81ff6fe51daf35a40d19785167004664d7e0c59d
2013-06-20 09:34:25 -07:00
Jim Bankoski
f954490bbf disable speed > 1 speed corrections in firstpass
need to rework these

Change-Id: I17dc2c88d2faadd2f8fb117c52c25f04ea2e9856
2013-06-20 09:34:03 -07:00
Jim Bankoski
fbcce4dd6f Merge "copy partitioning from last fame" 2013-06-20 09:32:43 -07:00
Jim Bankoski
f033b44e74 copy partitioning from last fame
Change-Id: I26e80ede80cb4389378a95afa95d229092a9859a
2013-06-20 09:32:19 -07:00
Yunqing Wang
3656835771 Merge "Add two-pass quantization" 2013-06-19 11:35:40 -07:00
Yunqing Wang
b5bf7b13a8 Add two-pass quantization
Optimized the quantization function by making it a two-pass
process. The first pass does a quick checking of the transform
coefficients against the base ZBIN, and only keep the good
enough set of coefficients for quantization. A skipping
check is added. If all coefficients are within the base ZBIN, no
quantization is needed. The second pass is the actual quantization
pass, which only processes the coefficient subset determined
in first pass. This reduces the computation. Furthermore, an
alternitive method is used for large transform size, which often
has sparse nonzero quantized coefficients.

Overall, the encoder speedup is about 4%. The quantization function
itself gets 20% faster.

Change-Id: I3a9dd0da6db030260b6d9c314a9fa48ecae89f22
2013-06-19 10:35:02 -07:00
Yaowu Xu
12180c8329 Remove unnecessary copying of probs.
Change-Id: Ic924f07c6ab0c929c6cdf11880d3c625806e272c
2013-06-18 23:02:27 -07:00
Dmitry Kovalev
87e1fa7627 Renaming 'nmv' to 'mv' for several functions.
Change-Id: I183a38997a9d01e4a1b869e92509f6915216fa09
2013-06-18 18:28:10 -07:00
Jingning Han
7088426976 Merge "Make fdct32 computation flow within 16bit range" 2013-06-18 11:40:14 -07:00
Dmitry Kovalev
dfc0385291 Merge "Removing vp9_invtrans.{c, h} files." 2013-06-18 10:16:25 -07:00
Jingning Han
a41a4860c0 Make fdct32 computation flow within 16bit range
This commit makes use of dual fdct32x32 versions for rate-distortion
optimization loop and encoding process, respectively. The one for
rd loop requires only 16 bits precision for intermediate steps.
The original fdct32x32 that allows higher intermediate precision (18
bits) was retained for the encoding process only.

This allows speed-up for fdct32x32 in the rd loop. No performance
loss observed.

Change-Id: I3237770e39a8f87ed17ae5513c87228533397cc3
2013-06-18 09:46:24 -07:00
Ronald S. Bultje
d9fc451666 Move subpixel variance function from common/ to encoder/.
This seems to only be used in the encoder. Also remove an empty wrapper
file that contained forward declarations for this function, but didn't
actually define any actual functions.

Change-Id: Ifc561eef7ebe374a7d03698055e51e105f6d614b
2013-06-17 16:54:09 -07:00
Dmitry Kovalev
686b99741c Removing vp9_invtrans.{c, h} files.
Moving single function from vp9_invtrans.c to vp9_encodemb.c.

Change-Id: I26bf6bb90de342a3036c0dbfba78a7dd75a61fe7
2013-06-17 16:09:03 -07:00
Ronald S. Bultje
a2f33e2505 Use assembly-optimized variance functions in sub_pixel_{avg}_var().
2.5% faster when encoding first 50 frames of bus @ 1500kbps.

Change-Id: I5a64703996cf7fd39b07e32c72311c4b125ec6d4
2013-06-17 14:57:13 -07:00
Ronald S. Bultje
53729c7786 Fix typo ('weight' instead of 'width').
Change-Id: I5d3944051d091b4bf3eb13e2a30132d34203ef74
2013-06-17 13:56:24 -07:00
John Koleszar
c2da365484 Merge "Remove constant vp9_coef_update_prob table" 2013-06-14 17:07:19 -07:00
John Koleszar
0f7a66e962 Remove constant vp9_coef_update_prob table
All elements of this table are equal to 252, so replace it with a
single constant VP9_COEF_UPDATE_PROB.

Change-Id: I1e2d1d284326ce6df9899a740c2fc344b3ec81c9
2013-06-14 15:12:31 -07:00
Jingning Han
0b7910b9ff Merge "Enable sse2 version of sad8x4/4x8" 2013-06-14 13:15:49 -07:00
Jingning Han
c43af9a8a3 Enable sse2 version of sad8x4/4x8
The encoding time for bus at CIF goes from 661s to 625s. This commit
also enabled unit test of sad8x4/4x8 in sad_test.cc.

Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1
2013-06-14 09:19:28 -07:00
Deb Mukherjee
4ad96115cd Some cleanups in rd motion search
No bitstream or output change - only cosmetics.

Change-Id: Ic8c1d7ad010a87dcf27d12a38cd7dd5adba683a7
2013-06-13 17:25:23 -07:00
Jingning Han
15f50e7b42 Enable sse2 version of sad8x4/4x8
The encoding time for bus at CIF goes from 661s to 625s. This commit
also enabled unit test of sad8x4/4x8 in sad_test.cc.

Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1
2013-06-13 16:18:18 -07:00
Ronald S. Bultje
fa96eeb835 Implement SSE version for sad4x8x4d and SSE2 version for sad8x4x4d.
Encoding time of crew (CIF, first 50 frames) @ 1500kbps goes from 4min56
to 4min42.

Change-Id: I92c0c8b32980d2ae7c6dafc8b883a2c7fcd14a9f
2013-06-12 17:40:01 -04:00
Ronald S. Bultje
b55f8b696a Merge "Fix row tiling." 2013-06-12 12:41:57 -07:00
John Koleszar
ad3b12f857 Merge "Fix chroma output when scaling" 2013-06-12 12:39:10 -07:00