229 Commits

Author SHA1 Message Date
Johann
b23bd2360f The subfunctions are only defined for sse2
See highbd_subpel_variance_impl_sse2.asm

Change-Id: Id13b97f4f6d189ed71cdc6d52b3c4ea63dc1da05
2016-05-06 18:58:49 -07:00
Johann
a761197fbd Unlike non-hbd variance, opt2 is never used
Change-Id: I1d342725df332c4efc6006d9e3dcb7372c41f448
2016-05-06 18:38:04 -07:00
Geza Lore
a0e1c23277 Add SSE2 versions of 128x128 vpx_sad*
Encoder speedup with all experiments enabled approx 15%.

Change-Id: Ib3c771d8da00989ddc9112b71b48ce7c5594e91a
2016-05-06 14:18:00 +01:00
James Zern
2184692c07 vpx_dsp/*.[hc]: add missing vpx_dsp_rtcd.h include
Change-Id: I103be7eee36492f8619144ce8325bc916d4975c7
2016-05-04 15:06:44 -07:00
James Bankoski
89f905e5e5 Merge "libvpx: add a unit test for plane_add_noise." 2016-05-04 13:09:05 +00:00
Jim Bankoski
34d5aff747 libvpx: add a unit test for plane_add_noise.
In so doing this fixes a couple of bugs:

vpx_plane_add_noise.c needed to subtract a clamp instead of add.
And the assembly (mmx sse) had assumptions that parameters were
continuous in memory which was not true.

Change-Id: I76f2c43cf54bfc838eb2edf8a443eaaa7565d7b5
2016-05-03 16:23:06 -07:00
James Bankoski
e755a283dd Merge "Move vpx_add_plane from codec to vpx_dsp and dedup." 2016-05-03 14:11:57 +00:00
Jim Bankoski
fce3cee8dd Move vpx_add_plane from codec to vpx_dsp and dedup.
Change-Id: I12218d8331c0558c0587a66321e3ca46da7e5cc7
2016-05-02 12:17:39 -07:00
Yi Luo
299c5fc202 HBD hybrid transform 8x8 SSE4.1 optimization
- Tx_type: DCT_DCT, DCT_ADST, ADST_DCT, ADST_ADST.
- Update bit-exact unit test against current C version.
- HBD encoder speed improves ~3.8%.

Change-Id: Ie13925ba11214eef2b5326814940638507bf68ec
2016-04-29 17:04:52 -07:00
Alex Converse
a68b24fdee Tweak casts on vpx_sub_pixel_variance to avoid implicit overflow.
Change-Id: I481eb271b082fa3497b0283f37d9b4d1f6de270c
2016-04-27 16:37:18 -07:00
Alex Converse
6c4007be1c Be explicit about overflow in vpx_variance16x16_sse2.
The product always fits in uint32_t, but the operands don't.

An optimizing compiler should generate the wraparound code.
(Verified with clang).

Change-Id: I25eb64df99152992bc898b8ccbb01d55c8d16e3c
2016-04-27 15:22:17 -07:00
Alex Converse
ccb894ce73 Remove casts on < 16x16 variance.
These blocks will never overflow since max sum is +/-255*w*h.

Change-Id: Ia2c630339fd9cfb411b56b6040ff402095f12a2e
2016-04-27 15:21:58 -07:00
Yaowu Xu
ed04e82a04 Merge branch 'master' into nextgenv2
Conflicts:
	vp10/common/scan.c
	vp9/common/vp9_pred_common.c
	vp9/decoder/vp9_decoder.c

Change-Id: Id559d98ea676da15d60ed464ddb6c48d3eed1111
2016-04-18 15:15:05 -07:00
Yi Luo
6db95602e4 Merge "Optimized HBD block subtraction for all block sizes" into nextgenv2 2016-04-12 21:22:32 +00:00
Yi Luo
0f80b1f754 Optimized HBD block subtraction for all block sizes
- Interface function takes a local MxN function to call based on the
  block size.
- Repetition call (w/o cache line miss) shows improvement:
  ~63% - ~340%.
- Overall encoder speed improvement: ~0.9%.

Change-Id: Ieff8f3d192415c61d6d58d8b99bb2a722004823f
2016-04-12 12:04:43 -07:00
Yi Luo
e5f4e8eab9 Some cosmetic improvements since HBD variance 4x4 optimization
Change-Id: I414c1fabd2e3a9b1d9daa8a90f85a0bace8bd3cd
2016-04-08 10:32:13 -07:00
James Zern
38bc1d0f4b vpx_fdct16x16_1_sse2: improve load pattern
load the full row rather than doing 2 8-wide columns

Change-Id: I7a1c0cba06b0dc1ae86046410922b1efccb95c95
2016-04-04 16:03:42 -07:00
James Zern
3735def667 vpx_fdctNxN_1_sse2: reduce store size
only output[0] needs to be set, store_output is more involved than a
movdqa in the high bitdepth case

Change-Id: I2cbd85d7cf74688bdf47eb767934fe42e02bff67
2016-04-04 16:02:06 -07:00
Yi Luo
250935cab3 Optimized HBD 4x4 variance calculation
vpx_highbd_8/10/12_variance4x4_sse4_1 improves performance ~7%-11%.

Change-Id: Ida22bb2a2f7a58037cfd73e186d4f6267a960c02
2016-04-04 11:28:59 -07:00
Geza Lore
552d5cd715 Extend superblock size fo 128x128 pixels.
If --enable-ext-partition is used at build time, the superblock size
(sometimes also referred to as coding unit (CU) size) is extended to
128x128 pixels.

Change-Id: Ie09cec6b7e8d765b7555ff5d80974aab60803f3a
2016-03-30 18:23:06 +01:00
Yaowu Xu
c810740c36 Merge branch 'masterbase' into nextgenv2
Conflicts:
	vp9/encoder/vp9_encoder.c
	vpx_dsp/x86/convolve.h

Change-Id: I60c3532936bedd796a75dfe78245a95ec21e2e55
2016-03-28 17:44:28 -07:00
Yunqing Wang
5f5552d846 Optimize HBD up-sampled prediction functions
Optimized 2 up-sampled reference prediction functions in high-bit
depth case. This reduced the HBD encoding time by 3%.

Change-Id: I8663ffb5234f5e70168c0fc9ca676309fe8e98f2
2016-03-14 19:04:33 -07:00
Yunqing Wang
e6e2d886d3 Add high-precision sub-pixel search as a speed feature
Using the up-sampled reference frames in sub-pixel motion search is
enabled as a speed feature for good-quality mode speed 0 and speed 1.

Change-Id: Ieb454bf8c646ddb99e87bd64c8e74dbd78d84a50
2016-03-11 16:32:11 -08:00
Scott LaVarnway
67c4c8244a VPX: loopfilter_mmx.asm using x86inc 2
This reverts commit 9aa083d164e0d39086aa0c83f0d1a0d0f0d1ba61.

Fixes a decoder mismatch with 32bit PIC builds.

Change-Id: I94717df662834810302fe3594b38c53084a4e284
2016-03-08 04:24:47 -08:00
Geza Lore
938b8dfc73 Extend convolution functions to 128x128 for ext-partition.
Change-Id: I7f7e26cd1d58eb38417200550c6fbf4108c9f942
2016-03-07 11:39:27 +00:00
James Zern
9aa083d164 Revert "VPX: loopfilter_mmx.asm using x86inc"
This reverts commit 15ecdc3970462c15fdf7185d373cb52664f40c0f.

breaks 32-bit pic builds

Change-Id: I8bb1b9471a293f05ac7423aaba0339d408931b7a
2016-03-04 18:23:45 -08:00
Geza Lore
697bf5beff Add 128 pixel variance and SAD functions
Change-Id: I8fde245b32c9e586683a28aa6925da0b83850b39
2016-03-03 10:24:29 +00:00
Debargha Mukherjee
1d69ceee5c Adds masked variance and sad functions for wedge
Adds masked variance and sad functions needed for wedge
prediction modes to come.

Change-Id: I25b231bbc345e6a494316abb0a7d5cd5586a3a54
2016-03-01 17:28:56 -08:00
Yunqing Wang
342a368fd4 Do sub-pixel motion search in up-sampled reference frames
Up-sampled the reference frames to 8 times in each dimension using
the 8-tap interpolation filter. In sub-pixel motion search, use the
up-sampled reference frames to find the best matching blocks. This
largely improved the motion search precision, and thus, improved
the compression quality. There was no change in decoder side.

Borg test and speed test results:
1. On derflr set,
Overall PSNR gain: 1.306%, and SSIM gain: 1.512%.
Average speed loss on derf set was 6.0%.
2. On stdhd set,
Overall PSNR gain: 0.754%, and SSIM gain: 0.814%.
On hevchd set,
Overall PSNR gain: 0.465%, and SSIM gain: 0.527%.
Speed loss on HD clips was 3.5%.

Change-Id: I300ebaafff57e88914f3dedc8784cb21d316b04f
2016-02-29 12:14:47 -08:00
Scott LaVarnway
dd6729f826 VPX: Remove pmin/pmax from subpixel functions.
These instructions are unnecessary if the adds
are done in the correct order.

Change-Id: I4e533b8267c32e610a4b94203ad052dc9fdabd71
2016-02-27 05:47:56 -08:00
Scott LaVarnway
51beb29f52 Merge "VPX: vpx_filter_block1d16_(v8, v8_avg)" 2016-02-27 13:31:18 +00:00
James Zern
654d2163c9 x86/convolve.h: remove redundant check in FUN_CONV_2D
the filter will be the same in this case

Change-Id: I95159bcb05bbfb71b57da741393e80cc7ffc5cff
2016-02-25 23:31:50 -08:00
James Zern
6d8c8c6201 x86/convolve.h: replace while w/if for w < 16
in non-hbd configurations; any high-bitdepth changes will be done in a
follow-up

Change-Id: Ia74e30971b744c1faab68c92fdeda1a053988c77
2016-02-25 21:44:06 -08:00
Scott LaVarnway
1f736e400f VPX: vpx_filter_block1d16_(v8, v8_avg)
Store result with one 16 byte store instead of
two 8 byte stores.

Change-Id: I43acbc5edfd6d6055a926f9b9605d47127400f09
2016-02-25 06:15:24 -08:00
James Zern
b3ceb629ba x86/convolve.h: change filter[] || chains to |
Change-Id: I661f64390f232826857b259e7a67e77f5a3a91ad
2016-02-24 19:47:43 -08:00
Yaowu Xu
aa6c754635 Merge remote-tracking branch 'webm/master' into nextgenv2 2016-02-24 10:53:17 -08:00
Scott LaVarnway
06d0e2fe6c BUG FIX: vpx_filter_block1d(8,4)_(v8, v8_avg)
Change-Id: Ic7ea79988ed0864e7ddbfeb312516bcf77eaaac1
2016-02-23 12:23:41 -08:00
Scott LaVarnway
15ecdc3970 VPX: loopfilter_mmx.asm using x86inc
Change-Id: Idcf29281d617b275e3ca50f77e6d00c60992a36d
2016-02-18 15:34:58 -08:00
James Zern
9b44d9d00f split vpx_highbd_lpf_horizontal_16 in two
replace with vpx_highbd_lpf_horizontal_edge_16 and
vpx_highbd_lpf_horizontal_edge_8 to avoid passing a count parameter

Change-Id: I551f8cec0fce57032cb2652584bb802e2248644d
2016-02-16 23:13:58 -08:00
James Zern
1b519fb666 split vpx_lpf_horizontal_16 in two
replace with vpx_lpf_horizontal_edge_16 and vpx_lpf_horizontal_edge_8 to
avoid passing a count parameter

Change-Id: I848c95c02a3c6ebaa6c2bdf0983dce05cd645271
2016-02-16 22:57:45 -08:00
James Zern
e7a23d703b vpx_highbd_lpf_horizontal_4: remove unused count param
Change-Id: I655a771e1b1a8753be5669ef9348a312ba6cfdbc
2016-02-16 22:57:45 -08:00
James Zern
5171857329 vpx_highbd_lpf_horizontal_8: remove unused count param
Change-Id: Iaca71ea3796115d4c2d43563b4e6f3914e21f1bf
2016-02-16 22:57:44 -08:00
James Zern
3c1019e49d vpx_highbd_lpf_vertical_4: remove unused count param
Change-Id: Ic6da723c5cf3cd8127db1f476c3e46ea134cb774
2016-02-16 22:57:44 -08:00
James Zern
72a9f06ac2 vpx_highbd_lpf_vertical_8: remove unused count param
Change-Id: Id16f7259897654831d31642c2d5e0bbe5e13416c
2016-02-16 22:57:44 -08:00
James Zern
b1e97c6a25 vpx_lpf_horizontal_4: remove unused count param
Change-Id: Iec7d8eda343991f7d7d46931dca17af23c821d11
2016-02-16 22:57:27 -08:00
James Zern
bd5a5bb561 vpx_lpf_horizontal_8: remove unused count param
Change-Id: I48741e167a7b09b7c9ad3bfc1c4b88ef1029ae46
2016-02-16 22:54:40 -08:00
James Zern
109a47b342 vpx_lpf_vertical_4: remove unused count param
Change-Id: I43a191cb3d42e51e7bca266adfa11c6239a8064c
2016-02-16 14:59:00 -08:00
James Zern
37225744db vpx_lpf_vertical_8: remove unused count param
Change-Id: Ic69406da00afb0f06588e8c0deb2b043952b078c
2016-02-16 14:59:00 -08:00
Geza Lore
abd00505d1 Add optimized vpx_sum_squares_2d_i16 for vp10.
Using this we can eliminate large numbers of calls to predict intra,
and is also faster than most of the variance functions it replaces.
This is an equivalence transform so coding performance is unaffected.

Encoder speedup is approx 7% when var_tx, super_tx and ext_tx are all
enabled.

Change-Id: I0d4c83afc4a97a1826f3abd864bd68e41bb504fb
2016-02-15 16:54:52 +00:00
Yaowu Xu
0aef1bc898 Enable sse2 version of inverse wht for hbd build
Change-Id: If8f5efd701a11c8a7ad3078d10ec3cd0fe27667e
2016-01-29 14:47:56 -08:00