Commit Graph

304 Commits

Author SHA1 Message Date
Alex Converse
6c4007be1c Be explicit about overflow in vpx_variance16x16_sse2.
The product always fits in uint32_t, but the operands don't.

An optimizing compiler should generate the wraparound code.
(Verified with clang).

Change-Id: I25eb64df99152992bc898b8ccbb01d55c8d16e3c
2016-04-27 15:22:17 -07:00
Alex Converse
ccb894ce73 Remove casts on < 16x16 variance.
These blocks will never overflow since max sum is +/-255*w*h.

Change-Id: Ia2c630339fd9cfb411b56b6040ff402095f12a2e
2016-04-27 15:21:58 -07:00
James Zern
38bc1d0f4b vpx_fdct16x16_1_sse2: improve load pattern
load the full row rather than doing 2 8-wide columns

Change-Id: I7a1c0cba06b0dc1ae86046410922b1efccb95c95
2016-04-04 16:03:42 -07:00
James Zern
3735def667 vpx_fdctNxN_1_sse2: reduce store size
only output[0] needs to be set, store_output is more involved than a
movdqa in the high bitdepth case

Change-Id: I2cbd85d7cf74688bdf47eb767934fe42e02bff67
2016-04-04 16:02:06 -07:00
Scott LaVarnway
67c4c8244a VPX: loopfilter_mmx.asm using x86inc 2
This reverts commit 9aa083d164.

Fixes a decoder mismatch with 32bit PIC builds.

Change-Id: I94717df662834810302fe3594b38c53084a4e284
2016-03-08 04:24:47 -08:00
James Zern
9aa083d164 Revert "VPX: loopfilter_mmx.asm using x86inc"
This reverts commit 15ecdc3970.

breaks 32-bit pic builds

Change-Id: I8bb1b9471a293f05ac7423aaba0339d408931b7a
2016-03-04 18:23:45 -08:00
Scott LaVarnway
dd6729f826 VPX: Remove pmin/pmax from subpixel functions.
These instructions are unnecessary if the adds
are done in the correct order.

Change-Id: I4e533b8267c32e610a4b94203ad052dc9fdabd71
2016-02-27 05:47:56 -08:00
Scott LaVarnway
51beb29f52 Merge "VPX: vpx_filter_block1d16_(v8, v8_avg)" 2016-02-27 13:31:18 +00:00
James Zern
654d2163c9 x86/convolve.h: remove redundant check in FUN_CONV_2D
the filter will be the same in this case

Change-Id: I95159bcb05bbfb71b57da741393e80cc7ffc5cff
2016-02-25 23:31:50 -08:00
James Zern
6d8c8c6201 x86/convolve.h: replace while w/if for w < 16
in non-hbd configurations; any high-bitdepth changes will be done in a
follow-up

Change-Id: Ia74e30971b744c1faab68c92fdeda1a053988c77
2016-02-25 21:44:06 -08:00
Scott LaVarnway
1f736e400f VPX: vpx_filter_block1d16_(v8, v8_avg)
Store result with one 16 byte store instead of
two 8 byte stores.

Change-Id: I43acbc5edfd6d6055a926f9b9605d47127400f09
2016-02-25 06:15:24 -08:00
James Zern
b3ceb629ba x86/convolve.h: change filter[] || chains to |
Change-Id: I661f64390f232826857b259e7a67e77f5a3a91ad
2016-02-24 19:47:43 -08:00
Scott LaVarnway
06d0e2fe6c BUG FIX: vpx_filter_block1d(8,4)_(v8, v8_avg)
Change-Id: Ic7ea79988ed0864e7ddbfeb312516bcf77eaaac1
2016-02-23 12:23:41 -08:00
Scott LaVarnway
15ecdc3970 VPX: loopfilter_mmx.asm using x86inc
Change-Id: Idcf29281d617b275e3ca50f77e6d00c60992a36d
2016-02-18 15:34:58 -08:00
James Zern
9b44d9d00f split vpx_highbd_lpf_horizontal_16 in two
replace with vpx_highbd_lpf_horizontal_edge_16 and
vpx_highbd_lpf_horizontal_edge_8 to avoid passing a count parameter

Change-Id: I551f8cec0fce57032cb2652584bb802e2248644d
2016-02-16 23:13:58 -08:00
James Zern
1b519fb666 split vpx_lpf_horizontal_16 in two
replace with vpx_lpf_horizontal_edge_16 and vpx_lpf_horizontal_edge_8 to
avoid passing a count parameter

Change-Id: I848c95c02a3c6ebaa6c2bdf0983dce05cd645271
2016-02-16 22:57:45 -08:00
James Zern
e7a23d703b vpx_highbd_lpf_horizontal_4: remove unused count param
Change-Id: I655a771e1b1a8753be5669ef9348a312ba6cfdbc
2016-02-16 22:57:45 -08:00
James Zern
5171857329 vpx_highbd_lpf_horizontal_8: remove unused count param
Change-Id: Iaca71ea3796115d4c2d43563b4e6f3914e21f1bf
2016-02-16 22:57:44 -08:00
James Zern
3c1019e49d vpx_highbd_lpf_vertical_4: remove unused count param
Change-Id: Ic6da723c5cf3cd8127db1f476c3e46ea134cb774
2016-02-16 22:57:44 -08:00
James Zern
72a9f06ac2 vpx_highbd_lpf_vertical_8: remove unused count param
Change-Id: Id16f7259897654831d31642c2d5e0bbe5e13416c
2016-02-16 22:57:44 -08:00
James Zern
b1e97c6a25 vpx_lpf_horizontal_4: remove unused count param
Change-Id: Iec7d8eda343991f7d7d46931dca17af23c821d11
2016-02-16 22:57:27 -08:00
James Zern
bd5a5bb561 vpx_lpf_horizontal_8: remove unused count param
Change-Id: I48741e167a7b09b7c9ad3bfc1c4b88ef1029ae46
2016-02-16 22:54:40 -08:00
James Zern
109a47b342 vpx_lpf_vertical_4: remove unused count param
Change-Id: I43a191cb3d42e51e7bca266adfa11c6239a8064c
2016-02-16 14:59:00 -08:00
James Zern
37225744db vpx_lpf_vertical_8: remove unused count param
Change-Id: Ic69406da00afb0f06588e8c0deb2b043952b078c
2016-02-16 14:59:00 -08:00
Yaowu Xu
0aef1bc898 Enable sse2 version of inverse wht for hbd build
Change-Id: If8f5efd701a11c8a7ad3078d10ec3cd0fe27667e
2016-01-29 14:47:56 -08:00
Yaowu Xu
b229710811 SSSE3 idct8x8 functions for highbitdpeth build
This commit changes SSSE3 optimized idct8x8 functions to work with
highbitdepth build.

With this commit and the previous one that enabled SSSE3 idct32x32
functions, tests showed virtually no difference on decoding speed for
file fdJc1_IBKJA.248.webm for the build with -enable-vp9-highbitdpeth
option and the build without the option.

Change-Id: Ibe0634149ec70e8b921e6b30171664b8690a9c45
2016-01-29 12:36:53 -08:00
Yaowu Xu
aac1ef7f80 Enable hbd_build to use SSSE3optimized functions
This commit changes the SSSE3 assembly functions for idct32x32 to
support highbitdepth build.

On test clip fdJc1_IBKJA.248.webm, this cuts the speed difference
between hbd and lbd build from between 3-4% to 1-2%.

Change-Id: Ic3390e0113bc1ca5bba8ec80d1795ad31b484fca
2016-01-29 01:30:43 +00:00
James Zern
3a2ad10de2 Merge "Code clean of sad4xNx4D_sse" 2016-01-25 20:57:15 +00:00
Alex Converse
ed3df445d9 Revert "Merge "Change highbd variance rounding to prevent negative variance.""
This reverts commit ea48370a50, reversing
changes made to 15939cb2d7.

The commit was insufficiently tested and causes failures.

Change-Id: I623d6fc2cd3ae6fd42d0abab1f8eada465ae57a7
2016-01-13 11:19:06 -08:00
Alex Converse
ea48370a50 Merge "Change highbd variance rounding to prevent negative variance." 2016-01-13 00:25:54 +00:00
Jian Zhou
26a6ce4c6d Code clean of highbd_tm_predictor_32x32
Remove the ARCH_X86_64 constraint. No performance hit on both
big core and small core.

Change-Id: I39860b62b7a0ae4acaafdca7d68f3e5820133a81
2015-12-22 16:51:57 -08:00
Jian Zhou
355bfa2193 Code clean of highbd_tm_predictor_16x16
Remove the ARCH_X86_64 constraint.

Change-Id: I0139f8e998cc5525df55161c2054008d21ac24d4
2015-12-22 16:34:40 -08:00
Jian Zhou
a4c265f1b7 Code clean of highbd_dc_predictor_32x32
Remove the ARCH_X86_64 constraint.

Change-Id: I7d2545fc4f24eb352cf3e03082fc4d48d46fbb09
2015-12-22 16:06:54 -08:00
James Zern
cedb1db594 Merge "Code clean of highbd_tm_predictor_4x4" 2015-12-22 16:45:01 +00:00
James Zern
a097963f80 Merge "Code clean of highbd_dc_predictor_4x4" 2015-12-22 16:30:37 +00:00
Jian Zhou
52e7f4153b Merge "Code clean of highbd_v_predictor_4x4" 2015-12-21 18:07:48 +00:00
Yunqing Wang
b597e3e188 Merge "Fix for issue 1114 compile error" 2015-12-19 04:29:39 +00:00
James Zern
8b2ddbc728 sad_sse2: fix sad4xN(_avg) on windows
reduce the register count by 1 to avoid xmm6 and unnecessarily
penalizing the other users of the base macro

Change-Id: I59605c9a41a31c1b74f67ec06a40d1a7f92c4699
2015-12-18 19:19:32 -08:00
Jian Zhou
db11307502 Code clean of highbd_tm_predictor_4x4
Replace MMX with SSE2, reduce mem access to left neighbor,
loop unrolled.

Change-Id: I941be915af809025f121ecc6c6443f73c9903e70
2015-12-18 18:43:41 -08:00
Jian Zhou
c91dd55eda Code clean of highbd_v_predictor_4x4
MMX replaced with SSE2, same performance.

Change-Id: I2ab8f30a71e5fadbbc172fb385093dec1e11a696
2015-12-18 15:25:27 -08:00
Jian Zhou
8366b414dd Code clean of highbd_dc_predictor_4x4
MMX replaced with SSE2, same performance.

Change-Id: Ic57855254e26757191933c948fac6aa047fadafc
2015-12-18 12:45:23 -08:00
Peter de Rivaz
7361ef732b Fix for issue 1114 compile error
In 32-bit build with --enable-shared, there is a lot of
register pressure and register src_strideq is reused.
The code needs to use the stack based version of src_stride,
but this doesn't compile when used in an lea instruction.

This patch also fixes a related segmentation fault caused by the
implementation using src_strideq even though it has been
reused.

This patch also fixes the HBD subpel variance tests that fail
when compiled without disable-optimizations.
These failures were caused by local variables in the assembler
routines colliding with the caller's stack frame.

Change-Id: Ice9d4dafdcbdc6038ad5ee7c1c09a8f06deca362
2015-12-18 09:43:22 +00:00
Jian Zhou
789dbb3131 Code clean of sad4xNx4D_sse
Replace MMX with SSE2.

Change-Id: I948ca1be6ed9b8e67f16555e226f1203726b7da6
2015-12-17 17:43:46 -08:00
Jian Zhou
b158d9a649 Code clean of sad4xN(_avg)_sse
Replace MMX with SSE2, reduce psadbw ops which may help Silvermont.

Change-Id: Ic7aec15245c9e5b2f3903dc7631f38e60be7c93d
2015-12-17 11:10:42 -08:00
James Zern
b81f04a0cc Merge "move vp9_avg to vpx_dsp" 2015-12-15 03:41:22 +00:00
James Zern
d36659cec7 move vp9_avg to vpx_dsp
Change-Id: I7bc991abea383db1f86c1bb0f2e849837b54d90f
2015-12-14 14:42:12 -08:00
Jian Zhou
2404e3290e Merge "Code clean of tm_predictor_32x32" 2015-12-14 17:56:01 +00:00
Jian Zhou
6e87880e7f Merge "Speed up tm_predictor_16x16" 2015-12-11 18:55:46 +00:00
Jian Zhou
88120481a4 Code clean of tm_predictor_32x32
Reallocate the xmm register usage so that no ARCH_X86_64 required.
Reduce memory access to the left neighbor by half.
Speed up by single digit on big core machine.

Change-Id: I392515ed8e8aeb02e6a717b3966b1ba13f5be990
2015-12-11 10:32:08 -08:00
Jian Zhou
62f986265f Merge "SSE2 based h_predictor_32x32" 2015-12-11 18:02:34 +00:00
James Zern
ecb8dff768 Merge "dc_left_pred[48]: fix pic builds" 2015-12-11 02:48:11 +00:00
Jian Zhou
5604924945 Merge "Code clean of dc_left/top_predictor_16x16" 2015-12-11 01:53:44 +00:00
James Zern
40ee78bc19 dc_left_pred[48]: fix pic builds
GET_GOT modifies the stack pointer so the offset for left's address will
be wrong if loaded afterword.

Change-Id: Iff9433aec45f5f6fe1a59ed8080c589bad429536
2015-12-10 15:44:31 -08:00
Yunqing Wang
322ea7ff5b Fix the win32 crash when GET_GOT is not defined
This patch continues to fix the win32 crash issue:
https://bugs.chromium.org/p/webm/issues/detail?id=1105

Johann's patch is here:
https://chromium-review.googlesource.com/#/c/316446/2

Change-Id: I7fe191c717e40df8602e229371321efb0d689375
2015-12-10 14:25:01 -08:00
Jian Zhou
4ec5953080 Code clean of dc_left/top_predictor_16x16
Remove some redundant code.

Change-Id: Ida2e8c0ce28770f7a9545ca014fe792b04295260
2015-12-10 11:59:58 -08:00
Jian Zhou
c90a8a1a43 SSE2 based h_predictor_32x32
Relocate the function from SSSE3 to SSE2, Unroll loop from 16 to 8,
and reduce mem access to left.
Speed up by single digit in ./test_intra_pred_speed on big core
machines.

Change-Id: I2b7fc95ffc0c42145be2baca4dc77116dff1c960
2015-12-10 10:09:58 -08:00
Johann Koenig
420b9f5bd3 Merge "fix null pointer crash in Win32 because esp register is broken" 2015-12-09 19:31:12 +00:00
Jian Zhou
aa5b517a39 Re-enable SSE2 based intra 4x4 prediction
4x4 Intra predictor implemented with MMX is replaced with SSE2.
Segfault in change 315561 when decoding vp8 is taken care of.

Change-Id: I083a7cb4eb8982954c20865160f91ebec777ec76
2015-12-07 18:50:37 -08:00
Scott LaVarnway
c7e557b82c Merge "VP9: Add ssse3 version of vpx_idct32x32_135_add()" 2015-12-07 21:13:35 +00:00
Sergey Kolomenkin
5fc9688792 fix null pointer crash in Win32 because esp register is broken
https://bugs.chromium.org/p/webm/issues/detail?id=1105

Change-Id: I304ea85ea1f6474e26f074dc39dc0748b90d4d3d
2015-12-07 12:57:06 -08:00
James Zern
79a9add666 Revert "MMX in intra 4x4 prediction replaced with SSE2"
This reverts commit 89a1efa4c4.

This causes a segfault when decoding vp8, in both 32 and 64-bit

Change-Id: Idbb9bb28ab897e1d055340497c47b49a12231367
2015-12-05 10:20:39 -08:00
Jian Zhou
e86c7c863e Speed up h_predictor_16x16
Relocate the function from SSSE3 to SSE2, Unroll loop from 8 to 4,
and reduce mem access to left.
Speed up by >20% in ./test_intra_pred_speed.

Change-Id: Ie48229c2e32404706b722442942c84983bda74cc
2015-12-04 12:12:55 -08:00
Jian Zhou
da3f08fac3 Speed up h_predictor_8x8
Relocate the function from SSSE3 to SSE2, Unroll loop from 4 to 2,
and reduce mem access to left.
Speed up by >20% in ./test_intra_pred_speed.

Change-Id: Ib9f1846819783b6e05e2a310c930eb844b2b4d2e
2015-12-04 11:36:44 -08:00
Jian Zhou
aa2764abdd MMX in intra 8x8 prediction replaced with SSE2
8x8 Intra predictor implemented with MMX is replaced with SSE2.

Change-Id: I0c90e7c1e1e6942489ac2bfe58903b728aac7a52
2015-12-03 18:11:06 -08:00
Jian Zhou
89a1efa4c4 MMX in intra 4x4 prediction replaced with SSE2
4x4 Intra predictor implemented with MMX is replaced with SSE2.

Change-Id: Id57da2a7c38832d0356bc998790fc1989d39eafc
2015-12-03 16:40:23 -08:00
Jian Zhou
623e988add Merge "SSE2 speed up of h_predictor_4x4" 2015-12-02 18:49:00 +00:00
Scott LaVarnway
f0b0b1fe62 VP9: Add ssse3 version of vpx_idct32x32_135_add()
Change-Id: I9a780131efaad28cf1ad233ae64c5c319a329727
2015-12-02 04:50:46 -08:00
Jian Zhou
c7fae5d893 Speed up tm_predictor_16x16
Reduce mem access to left. Speed up by 10% in ./test_intra_pred_speed
with the same instruction size.

Change-Id: Ia33689d62476972cc82ebb06b50415aeccc95d15
2015-11-30 17:46:40 -08:00
Scott LaVarnway
2669e05949 Merge "VPX: x86 asm version of vpx_idct32x32_1024_add()" 2015-11-30 23:28:27 +00:00
Jian Zhou
9d29d76280 SSE2 speed up of h_predictor_4x4
Relocate h_predictor_4x4 from SSSE3 to SSE2 with XMM registers.
Speed up by ~25% in ./test_intra_pred_speed.

Change-Id: I64e14c13b482a471449be3559bfb0da45cf88d9d
2015-11-30 10:08:05 -08:00
Scott LaVarnway
0148e20c3c VPX: x86 asm version of vpx_idct32x32_1024_add()
Change-Id: I3ba4ede553e068bf116dce59d1317347988b3542
2015-11-25 10:11:29 -08:00
Jian Zhou
901d20369a Merge "Speed up tm_predictor_8x8" 2015-11-25 02:34:07 +00:00
Alex Converse
022c848b4d Change highbd variance rounding to prevent negative variance.
Always round sum error and sum square error toward zero in variance
calculations. This prevents variance from becoming negative.
Avoiding rounding variance at all might be better but would be far
more invasive.

Change-Id: Icf24e0e75ff94952fc026ba6a4d26adf8d373f1c
2015-11-24 16:32:01 -08:00
Jian Zhou
f4621c5c8d Speed up tm_predictor_8x8
Left neighbor read from memory only once.
Speed up by ~20% in ./test_intra_pred_speed.

Change-Id: Ia1388630df6fed0dce9a6eeded6cb855bbc43505
2015-11-24 16:07:06 -08:00
Scott LaVarnway
97e6cc6198 VPX: Removed unnecessary pmulhrsw in IDCT32X32_34
and fixed macro name.

Change-Id: I306b98a2b4ec80b130ae80290b4cd9c7a5363311
2015-11-23 10:24:09 -08:00
James Zern
16eba81f69 Revert "Speed up h_predictor_4x4"
This reverts commit d76032ae87.

breaks 32-bit builds

Change-Id: If6266ec2a405b5a21d615112f0f37e8a71193858
2015-11-20 22:25:29 -08:00
James Zern
1b10753ad7 Merge "Speed up h_predictor_4x4" 2015-11-21 01:12:42 +00:00
Scott LaVarnway
e7fc39fdf5 Merge "VPX: x86 asm version of vpx_idct32x32_34_add()" 2015-11-20 15:11:00 +00:00
Jian Zhou
d76032ae87 Speed up h_predictor_4x4
Modify h_predictor_4x4 with XMM registers.
Speed up by ~25% in ./test_intra_pred_speed.

Change-Id: Id01c34c48e75b9d56dfc2e93af12cf0c0326a279
2015-11-19 11:34:22 -08:00
Jian Zhou
79b68626ae Speed up tm_predictor_4x4
tm_predictor_4x4 is implemented with SSE2 using XMM registers.
Speed up by ~25% in ./test_intra_pred_speed.

Change-Id: I25074b78d476a2cb17f81cf654bdfd80df2070e0
2015-11-18 16:44:25 -08:00
Scott LaVarnway
ed833048c2 VPX: x86 asm version of vpx_idct32x32_34_add()
Change-Id: Ic81f38998fb1b8d33f5a5d7424c2c41002786cef
2015-11-17 17:42:24 -08:00
James Zern
0ccad4d649 Revert "VPX: x86 asm version of vpx_idct32x32_34_add()"
This reverts commit 9aeaa2016e.

This causes some test vectors to fail.

Change-Id: I3659a2068404ec5a0591fba5c88b1bec0c9059a4
2015-11-11 11:12:38 -08:00
James Zern
e3efed7f4c Merge "convolve_copy_sse2: replace SSE w/SSE2 code" 2015-11-10 22:35:12 +00:00
Scott LaVarnway
f48321974b Merge "VPX: x86 asm version of vpx_idct32x32_34_add()" 2015-11-10 21:40:11 +00:00
Scott LaVarnway
9aeaa2016e VPX: x86 asm version of vpx_idct32x32_34_add()
Change-Id: I8a933c63b7fbf3c65e2c06dbdca9646cadd0b7cb
2015-11-10 11:54:56 -08:00
James Zern
40dab58941 convolve_copy_sse2: replace SSE w/SSE2 code
this should be neutral or slightly faster on modern (P4+) architectures

Change-Id: Iec4c080275941eb8c9e05a66a2daf0405d86a69b
2015-11-09 23:45:16 -08:00
Geza Lore
9cfba09ac0 Optimize vpx_quantize_{b,b_32x32} assembler.
Added optimization of the 8 bit assembly quantizer routines. This makes
these functions up to 100% faster, depending on encoding parameters.

This patch maskes the encoder faster in both the high bitdepth and 8bit
configurations. In the high bitdepth configuration, it effects profile 0
only.

Based on my profiling using 1080p input the net gain is between 1-3% for
the 8 bit config, and around 2.5-4.5% for the high bitdepth config,
depending on target bitrate. The difference between the 8 bit and high
bitdepth configurations for the same encoder run is reduced by 1% in all
cases I have profiled.

Change-Id: I86714a6b7364da20cd468cd784247009663a5140
2015-10-20 10:11:19 +01:00
Johann
ec623a0bb7 Upstream Mozilla fix for older Apple clang builds
Also use the _mm_broadcastsi128_si256 intrisic for
Apple clang versions 4.[012]

https://bugzilla.mozilla.org/show_bug.cgi?id=1085607
https://code.google.com/p/webm/issues/detail?id=1082

Change-Id: I6bc821d8163387194ef663e94bfed91fa7281d88
2015-10-14 07:41:23 -07:00
Alex Converse
0c00af126d Add vpx_highbd_convolve_{copy,avg}_sse2
single-threaded:
swanky (silvermont): ~1% faster overall
peppy (celeron,haswell): ~1.5% faster overall

Change-Id: Ib74f014374c63c9eaf2d38191cbd8e2edcc52073
2015-10-09 11:50:25 -07:00
Geza Lore
cbada4a982 Remove 4 mova insts from quantize_ssse3_x86_64.asm
Change-Id: If3cb9345b44162e600e6c74873e0cb4c207fc7fb
2015-10-09 07:52:04 -07:00
Julia Robson
37c68efee2 SSSE3 optimisation for quantize in high bit depth
When configured with high bit detpth enabled, the 8bit quantize
function stopped using optimised code. This made 8bit content
decode slowly. This commit re-enables the SSSE3 optimisations.

Change-Id: I194b505dd3f4c494e5c5e53e020f5d94534b16b5
2015-10-06 13:32:02 +01:00
Scott LaVarnway
b212094839 Merge "VPX: refactor vpx_idct32x32_1_add_sse2()" 2015-10-06 11:35:15 +00:00
Julia Robson
5e6533e707 SSE2 optimisation for quantize in high bit depth
When configured with high bit detpth enabled, the 8bit quantize
function stopped using optimised code. This made 8bit content
decode slowly. This commit re-enables the SSE2 optimisation
(but not the SSSE3 optimisation).

Change-Id: Id015fe3c1c44580a4bff3f4bd985170f2806a9d9
2015-10-05 10:59:16 -07:00
Scott LaVarnway
23d1c06268 VPX: refactor vpx_idct32x32_1_add_sse2()
Change-Id: Ia1a2cac0e9dc05f3207b3433a6c1589fa7f2aee3
2015-10-05 06:33:42 -07:00
Julia Robson
406030d1b0 Accelerated transform in high bit depth
When configured with high bitdepth enabled, the 8bit transform
stopped using optimised code. This made 8bit content decode slowly.

Change-Id: I67d91f9b212921d5320f949fc0a0d3f32f90c0ea
2015-09-28 21:09:16 -07:00
Johann
dd4f953350 Remove vpx_filter_block1d16_v8_intrin_ssse3
This was rewritten and moved to vpx_dsp/x86/vpx_subpixel_8t_ssse3.asm
in 195883023b

Change-Id: I117ce983dae12006e302679ba7f175573dd9e874
2015-09-18 16:05:43 -07:00
James Zern
683b5a3161 vpx_subpixel_8t_ssse3: fix reg counts/access
fixes build on windows x64; previously 'heightq' i.e., the 64-bit register
was accessed when only the 32-bit value was needed. given this is from a
stack variable the upper bits were undefined.

+ bump register/xmm counts; users of SETUP_LOCAL_VARS touch xmm13 in
64-bit builds and filter_block1d16_v* uses one extra temp variable

Change-Id: I9c768c0b2047481d1d3b11c2e16b2f8de6eb0d80
2015-09-17 12:27:34 -07:00
Scott LaVarnway
195883023b VPX: subpixel_8t_ssse3 asm using x86inc
This is based on the original patch optimized for 32bit
platforms by Tamar/Ilya and now uses the x86inc style asm.
The assembly was also modified to support 64bit platforms.

Change-Id: Ice12f249bbbc162a7427e3d23fbf0cbe4135aff2
2015-09-03 20:35:51 -07:00
Johann Koenig
5c245a46d8 Merge changes I53b5bdc5,Ib81168a7,Ie0113945
* changes:
  Only build ssse3 filter functions on 64 bit
  Clean up unused function warnings in vp8 encoder
  Clean up unused function warnings in vp8 onyx_if.c
2015-08-27 20:58:53 +00:00
Johann
a28b2c6ff0 Add sse2 versions of halfpix variance
These were lost in the great sub pixel variance move of
6a82f0d7fb

Not having these functions caused a ~10% performance regression in
some realtime vp8 encodes.

Change-Id: I50658483d9198391806b27899f2c0d309233c4b5
2015-08-27 11:58:38 -07:00
Johann
f5507b514c Only build ssse3 filter functions on 64 bit
Avoid an unused function warning by only building the functions when
they will be used.

Change-Id: I53b5bdc5a180c79d63b34e4c8921d679bbc54009
2015-08-26 10:32:18 -07:00
Scott LaVarnway
6c0f6dd817 Merge "VPX: scaled convolve : fix windows build errors" 2015-08-21 12:06:34 +00:00
Scott LaVarnway
acf24cc1b8 VPX: scaled convolve : fix windows build errors
Change-Id: Ic81d435ea928183197040cdf64b6afd7dbaf57e4
2015-08-20 13:09:27 -07:00
Scott LaVarnway
6a21ca20cc Merge "VPX ssse3 scaled convolve" 2015-08-19 22:12:21 +00:00
Jingning Han
49f6ff1103 Rename inv_txfm_sse2.asm to inv_wht_sse2.asm
Change-Id: I43bcc70680503e4c18d8f021097307778cf9ea70
2015-08-19 10:29:53 -07:00
Scott LaVarnway
2030c49cf8 VPX ssse3 scaled convolve
Change-Id: I71d5994e21813554a927d35ebcc26bf7a68984fd
2015-08-18 15:13:02 -07:00
Scott LaVarnway
6cf95bd1e7 Merge "VPX: remove step == 16 and filter[3] != 128 checks" 2015-08-12 20:13:33 +00:00
Scott LaVarnway
b04dad328c Merge "VPX: remove scaled calls from FUN_CONV_1D" 2015-08-11 21:46:50 +00:00
Scott LaVarnway
4ef08dcec8 Merge "VPX: Add rtcd support for scaling." 2015-08-11 13:19:00 +00:00
James Zern
9265bad906 Merge changes from topic 'x86inc'
* changes:
  Only use .text sections for aout
  Use newer x86inc.asm
  Use .text instead of .rodata on macho
  Copy PIC handling code from x86_abi_support
  Set 'private_extern' visibility for macho targets
  Avoid 'amdnop' when building with nasm
  Catch all elf formats
  Expand PIC default to macho64 and respect CONFIG_PIC from libvpx
  Use libvpx defines to set name mangling rules
  Customize x86inc.asm for libvpx
2015-08-10 21:20:38 +00:00
Scott LaVarnway
a229dbc1f0 VPX: remove step == 16 and filter[3] != 128 checks
from FUN_CONV_1D and FUN_CONV_2D macros.  The functions
will not be called with these inputs.

Change-Id: I67ec75e4edafc0acee70190521a80ea85dfa521b
2015-08-10 13:44:32 -07:00
Johann
41a0a0cb35 Use newer x86inc.asm
Rename updated version of x86inc.asm

Use "private_prefix" instead of "program_name" and make vpx the default
prefix.

Change-Id: I4883a99b2aee8e5dc9f2c16a2e6f4b5d6e4de458
2015-08-07 16:44:44 -07:00
Alex Converse
c65e79d2e5 ssim: Replace unsigned long with uint32_t.
The assembly only writes the low 4 bytes, and the HBD version only uses
uint32_t bytes.

Change-Id: Ie3694ecda511c231e55870df814cbae30e588073
2015-08-07 11:48:31 -07:00
Alex Converse
c7b7011b9b Move VP9 SSIM metrics to vpx_dsp.
Change-Id: I20c7b42631b579fade6cf7ebf6d4c69b2fcb5e5e
2015-08-06 18:25:25 -07:00
Aℓex Converse
7ac505c726 Merge "Narrow a load in iwht4x4_16_add." 2015-08-06 22:21:16 +00:00
Alex Converse
0572052725 Narrow a load in iwht4x4_16_add.
The top half is unused.

Change-Id: I29b2f6a93e20ea43aff4ad0bd2d52257e1e752b6
2015-08-05 12:16:12 -07:00
Scott LaVarnway
4e6b5079c6 VPX: remove scaled calls from FUN_CONV_1D
and FUN_CONV_2D macros.  The predict lut now handles
this case.  The encoder now calls vpx_scaled_2d() instead
of vpx_convolve8() for scaling.

Change-Id: Ia1c8af8a31e4cb4887a587143108cb45835f7df7
2015-08-05 10:47:06 -07:00
James Zern
afd2f68dae Revert "VP9_COPY_CONVOLVE_SSE2 optimization"
This reverts commit a5e97d874b.

Additionally:
Revert "vpx_convolve_copy_sse2: fix win64"

This reverts commit 22a8474fe7.

This change performs poorly on various x86_64 devices affecting
performance by 1-3% at 1080P. Performance on chromebook like devices was
mixed neutral to slightly negative, so there should be minimal change
there.

Change-Id: I95831233b4b84ee96369baa192a2d4cc7639658c
2015-08-04 17:57:01 -07:00
Jingning Han
d621de7e8d Change vp9_quantize to vpx_quantize
This commit clears all the vp9_ prefix use case in vpx_dsp. It gets
the vp9 folder ready to branch out vp10.

Change-Id: I2906eec179ee792b4af8c9b4161313653050e931
2015-08-04 15:31:49 -07:00
Jingning Han
08a453b9de Replace vp9_ prefix with vpx_ prefix in vpx_dsp function names
This commit clears the function naming convention in vpx_dsp. It
replaces vp9_ prefix of global functions with vpx_ prefix. It also
removes the vp9_ prefix from static functions.

Change-Id: I6394359a63b71a51dda01342eec6a3cc08dfeedf
2015-08-04 13:46:11 -07:00
Scott LaVarnway
8f6b943100 VPX: Add rtcd support for scaling.
Change-Id: If34bfb0d918967445aea7dc30cd7b55ebfedb1f2
2015-08-03 09:43:34 -07:00
Jingning Han
d10fc5af8f Merge "Add vpx_dsp_rtcd.h to inv_txfm_sse2.c" 2015-08-03 16:03:09 +00:00
Jingning Han
80ae856c8b Add vpx_dsp_rtcd.h to inv_txfm_sse2.c
Change-Id: Ibab434fb4bd6da02dba087582ed74811f555c3ed
2015-08-02 08:25:13 -07:00
James Zern
22a8474fe7 vpx_convolve_copy_sse2: fix win64
xmm6-7 need to be stored

Change-Id: I6c51559598d335946ec91be6246b49589c63b724
2015-08-01 11:45:49 -07:00
Jingning Han
b4c7d0523a Merge "Factor inverse transform functions into vpx_dsp" 2015-08-01 16:20:24 +00:00
Jingning Han
e8b133c79c Factor inverse transform functions into vpx_dsp
This commit moves the module inverse transform functions from vp9
to vpx_dsp folder. The hybrid transform wrapper functions stay in
the vp9 folder, since it involves codec-specific data structures.

Change-Id: Ib066367c953d3d024c73ba65157bbd70a95c9ef8
2015-07-31 16:21:00 -07:00
Scott LaVarnway
a5e97d874b VP9_COPY_CONVOLVE_SSE2 optimization
This function suffers from a couple problems in small core(tablets):
-The load of the next iteration is blocked by the store of previous iteration
-4k aliasing (between future store and older loads)
-current small core machine are in-order machine and because of it the store will spin the rehabQ until the load is finished
fixed by:
- prefetching 2 lines ahead
- unroll copy of 2 rows of block
- pre-load all xmm regiters before the loop, final stores after the loop
The function is optimized by:
copy_convolve_sse2 64x64 - 16%
copy_convolve_sse2 32x32 - 52%
copy_convolve_sse2 16x16 - 6%
copy_convolve_sse2 8x8 - 2.5%
copy_convolve_sse2 4x4 - 2.7%
credit goes to Tom Craver(tom.r.craver@intel.com) and Ilya Albrekht(ilya.albrekht@intel.com)

Change-Id: I63d3428799c50b2bf7b5677c8268bacb9fc29671
2015-07-31 14:51:51 -07:00
Zoe Liu
7186a2dd86 Code refactor on InterpKernel
It in essence refactors the code for both the interpolation
filtering and the convolution. This change includes the moving
of all the files as well as the changing of the code from vp9_
prefix to vpx_ prefix accordingly, for underneath architectures:
(1) x86;
(2) arm/neon; and
(3) mips/msa.
The work on mips/drsp2 will be done in a separate change list.

Change-Id: Ic3ce7fb7f81210db7628b373c73553db68793c46
2015-07-31 10:27:33 -07:00
Hui Su
4cbf36b105 Merge "Replace prefix vp9_ with vpx_ for intra prediction functions" 2015-07-29 00:38:48 +00:00
Jingning Han
d12a4a825c Merge "Replace vp9_ prefix in 2D-DCT functions with vpx_" 2015-07-29 00:07:31 +00:00
Jingning Han
fc18cf7a11 Merge "Move DC only forward 2D-DCT functions to vpx_dsp" 2015-07-29 00:06:37 +00:00
Jingning Han
4b5109cd73 Replace vp9_ prefix in 2D-DCT functions with vpx_
Clean up the forward 2D-DCT function names in vpx_dsp.

Change-Id: I3117978596d198b690036e7eb05fe429caf3bc25
2015-07-28 16:06:44 -07:00
Jingning Han
d19033fa4e Move DC only forward 2D-DCT functions to vpx_dsp
This completes the forward transform functions layout refactoring.

Change-Id: I996fb0fb795f41e2040f7b21db985774098aedbd
2015-07-28 14:52:30 -07:00
Johann
124ada514b Don't use 'h' for functions using x86inc.asm
In newer version of x86inc.asm 'h' is used as a modifier for register
names.

Change-Id: Ie5b9dd2f91ecdc8f6f18b2701b6dc23042b604e4
2015-07-28 14:00:32 -07:00
Hui Su
fe7cabe8b6 Merge "Move intra prediction functions from vp9/common/ to vpx_dsp/" 2015-07-28 20:41:01 +00:00
Jingning Han
a6a4659bea Factor 32x32 fwd DCT to vpx_dsp folder
Move the 32x32 2D-DCT implementations from vp9/ to vpx_dsp/.

Change-Id: Id3980696f8b69906ff7a59ff9fb2b9013d60047d
2015-07-28 11:13:41 -07:00
Jingning Han
8eefb36ca9 Move forward dct sse2 header file to vpx_dsp
Change-Id: Iba03852ce778c956200818e3473cfb2b48cf8d8e
2015-07-27 14:59:57 -07:00
hui su
4013645353 Replace prefix vp9_ with vpx_ for intra prediction functions
Change-Id: I8ae6fb586f8d5d018ace228df11714f82b085076
2015-07-27 13:42:06 -07:00
hui su
7971846a5e Move intra prediction functions from vp9/common/ to vpx_dsp/
Change-Id: I64edc26cf4aab050c83f2d393df6250628ad43b8
2015-07-27 13:38:16 -07:00
Jingning Han
5ebc8febdc Refactor vp9_idct.h file
Separate the common coefficient constant into vpx_dsp/txfm_common.h.
Move the SSE2 macro definitions to vpx_dsp/x86/txfm_common_sse2.h.
This clears the use case of vp9_idct.h in vpx_dsp folder.

Change-Id: I319735a2abf42888e5080ac14cfbcde34be7b121
2015-07-26 08:26:32 -07:00
Jingning Han
252ec59821 Remove vp9_dct.h from fwd_txfm_impl_sse2 header file
Change-Id: Ib3a4814fdb9d69cf6cc23bdd208f9bc9e7972edc
2015-07-24 21:11:44 +00:00
Jingning Han
5ddfa101c9 Add x86_64 flag to guard fwd_txfm_ssse3.asm in make file
This fixes a VS build error. Fix by @johannkoenig.

Change-Id: I6e71435d70ae56079db7328e4c7915416ece8fda
2015-07-23 12:55:50 -07:00
Jingning Han
b67821f37b Factor forward 2D-DCT transforms into vpx_dsp
This commit factors the 4x4, 8x8, and 16x16 2D-DCT forward
transform operations into vpx_dsp folder.

Change-Id: I084b117b79c0925edcbcabb93f62b9f4bf8dbe7d
2015-07-22 15:48:17 -07:00
Yunqing Wang
f65473c036 Merge "Migrate quantization functions from vp9/ to vpx_dsp/" 2015-07-20 16:20:07 +00:00
Yunqing Wang
38f1fbbb75 Migrate quantization functions from vp9/ to vpx_dsp/
The following quantization functions were moved:
vp9_quantize_b
vp9_quantize_b_32x32
vp9_highbd_quantize_b
vp9_highbd_quantize_b_32x32

vp9_quantize_dc
vp9_quantize_dc_32x32
vp9_highbd_quantize_dc
vp9_highbd_quantize_dc_32x32

The purpose of doing that was to allow these functions to be shared
by multiple codecs.

Change-Id: Id8ab939f283353cdd07bd930d47db3d932a5d87f
2015-07-17 16:38:14 -07:00
Jingning Han
2992739b5d Rename loop filter function from vp9_ to vpx_
Change-Id: I6f424bb8daec26bf8482b5d75dd9b0e45c11a665
2015-07-17 15:55:02 -07:00
Jingning Han
6e5ab70a93 Take out unnecessary header file from highbd_loopfilter_sse2
The dependency on vp9_loopfilter.h is not needed.

Change-Id: Ic0583c43d3d63f19cef06cf9d8e5c8031601be6a
2015-07-16 18:15:02 -07:00
Jingning Han
50adfdf5ba Migrate loop filter functions from vp9/ to vpx_dsp/
The various tap loop filter operations are common functions across
codec. This commit moves them along with SIMD optimizations to
vpx_dsp folder.

Change-Id: Ia5fa0b2e5289cdb98467502a549c380b9c60e92c
2015-07-16 16:40:47 -07:00
Yaowu Xu
c369daf3ea Clean out more MSVC warnings
Change-Id: I1bab0c104df2ec4825d050cd516e26ab635a7b3e
2015-07-08 15:09:20 -07:00
Johann
6a82f0d7fb Move sub pixel variance to vpx_dsp
Change-Id: I66bf6720c396c89aa2d1fd26d5d52bf5d5e3dff1
2015-07-07 15:51:04 -07:00
Jingning Han
432cd4bfb7 Move subtract functions from vp9 to vpx_dsp
Factor out the subtraction operator as common function.

Change-Id: I526e703477c6a290e0e3e3c8898f8bb1ca82779b
2015-07-06 12:22:47 -07:00
Johann
c3bdffb0a5 Move variance functions to vpx_dsp
subpel functions will be moved in another patch.

Change-Id: Idb2e049bad0b9b32ac42cc7731cd6903de2826ce
2015-05-26 12:01:52 -07:00
James Zern
4be50c5289 sad*_avx2.c: sync function signatures
+ include vpx_dsp_rtcd.h
silences missing prototype warnings

Change-Id: Ifa1780bcf72b1fa2b153025d0d78d91ad38774c3
2015-05-14 20:58:56 -07:00
Johann
d5d9289800 Move shared SAD code to vpx_dsp
Create a new component, vpx_dsp, for code that can be shared
between codecs. Move the SAD code into the component.

This reduces the size of vpxenc/dec by 36k on x86_64 builds.

Change-Id: I73f837ddaecac6b350bf757af0cfe19c4ab9327a
2015-05-06 16:58:20 -07:00