Commit Graph

304 Commits

Author SHA1 Message Date
Parag Salasakar
ef51c1ab5b mips msa vp9 convolve8 hv optimization
average improvement ~5x-8x

Change-Id: I3214734cb3716e742907ce0d2d7a042d953df82b
2015-04-21 09:17:49 +05:30
Parag Salasakar
2e36149ccd Merge "mips msa vp9 convolve8 vert optimization" 2015-04-18 23:39:25 -07:00
Parag Salasakar
27d083c1b9 mips msa vp9 convolve8 vert optimization
average improvement ~6x-10x

Change-Id: Ie3f3ab3a9005be84935919701e56b404e420affa
2015-04-18 08:13:04 +05:30
Marco Paniconi
f76ccce5bc Revert "Revert "Force_split on 16x16 blocks in variance partition.""
This reverts commit 004b9d83e3

Change-Id: I2f2d0bdb9368c2c07f1d29a69cd461267a3a8743
2015-04-16 17:52:13 -07:00
Yunqing Wang
004b9d83e3 Revert "Force_split on 16x16 blocks in variance partition."
This reverts commit eb8c667570.
The patch caused mismatch while using multi-threads.

Change-Id: Icd646340af25b5d91e32f03ed3ea212e00e3e0be
2015-04-14 15:19:31 -07:00
Marco
eb8c667570 Force_split on 16x16 blocks in variance partition.
Force split on 16x16 block (to 8x8) based on the minmax over the 8x8 sub-blocks.

Also increase variance threshold for 32x32, and add exit condiiton in choose_partition
(with very safe threshold) based on sad used to select reference frame.

Some visual improvement near moving boundaries.
Average gain in psnr/ssim: ~0.6%, some clips go up ~1 or 2%.
Encoding time increase (due to more 8x8 blocks) from ~1-4%, depending on clip.

Change-Id: I4759bb181251ac41517cd45e326ce2997dadb577
2015-04-13 12:05:07 -07:00
Jingning Han
93d9c50419 Merge "SSSE3 assembly implementation of 8x8 Hadamard transform" 2015-04-09 11:16:11 -07:00
Jingning Han
7f629dfca4 SSSE3 assembly implementation of 8x8 Hadamard transform
It uses about 10% less CPU cycles than the SSE2 intrinsic
implementation.

Change-Id: I91017c0c068679a214b98cdd4cff3a6facfb7499
2015-04-04 09:59:37 -07:00
James Zern
44e3640923 Merge "vp9: enable sse4 sad functions" 2015-04-03 14:57:52 -07:00
James Zern
b644384bb5 Merge "vp9: fix high-bitdepth NEON build" 2015-04-01 23:36:17 -07:00
Jingning Han
1470529f62 Refactor block_yrd function for RTC coding mode
This commit separates Hadamard transform/quantization operations
from rate and distortion computation in block_yrd. This allows one
to skip SATD computation when all transform blocks are quantized
to zero. It also uses a new block error function that skips
repeated computation of sum of squared residuals. It reduces the
CPU cycles spent on block error calculation in block_yrd by 40%.

Change-Id: I726acb2454b44af1c3bd95385abecac209959b10
2015-04-01 12:00:43 -07:00
James Zern
14e24a1297 vp9: enable sse4 sad functions
sse4 isn't set by configure or used in rtcd, correct the sad entries to
use sse4_1 without changing the signatures for now.
this was done in vp8 post-vp9 branch.

Change-Id: Ia9f1fff9f2476fdfa53ed022778dd2f708caa271
2015-03-31 21:00:55 -07:00
James Zern
8845334097 vp9: fix high-bitdepth NEON build
remove incorrect specializations in rtcd and update a configuration
check in partial_idct_test.cc

Change-Id: I20f551f38ce502092b476fb16d3ca0969dba56f0
2015-03-31 17:45:25 -07:00
Jingning Han
26d3d3af6a Enable 16x16 Hadamard transform in SATD based mode decision
This commit replaces the 16x16 2D-DCT transform with Hadamard
transform for RTC coding mode. It reduces the CPU cycles cost
on 16x16 transform by 5X. Overall it makes the speed -6 encoding
speed 1.5% faster without compromise on compression performance.

Change-Id: If6c993831dc4c678d841edc804ff395ed37f2a1b
2015-03-30 15:43:31 -07:00
Jingning Han
8c411f74e0 Hadamard transform based coding mode decision process
This commit uses Hadamard transform based rate-distortion cost
estimate for rtc coding mode decision. It improves the compression
performance of speed -6 for many hard clips at lower bit-rates.
For example, 5.5% for jimredvga, 6.7% for mmmoving, 6.1% for
niklas720p. This will introduce extra encoding cycle costs at
this point.

Change-Id: Iaf70634fa2417a705ee29f2456175b981db3d375
2015-03-30 14:46:05 -07:00
Jingning Han
1790d45252 Use variance metric for integral projection vector match
This commit replaces the SAD with variance as metric for the
integral projection vector match. It improves the search accuracy
in the presence of slight light change. The average speed -6
compression performance for rtc set is improved by 1.7%. No speed
changes are observed for the test clips.

Change-Id: I71c1d27e42de2aa429fb3564e6549bba1c7d6d4d
2015-03-01 10:42:56 -08:00
Jingning Han
ed2dc59c1b Integral projection based motion estimation
This commit introduces a new block match motion estimation
using integral projection measurement. The 2-D block and the nearby
region is projected onto the horizontal and vertical 1-D vectors,
respectively. It then runs vector match, instead of block match,
over the two separate 1-D vectors to locate the motion compensated
reference block.

This process is run per 64x64 block to align the reference before
choosing partitioning in speed 6. The overall CPU cycle cost due
to this additional 64x64 block match (SSE2 version) takes around 2%
at low bit-rate rtc speed 6. When strong motion activities exist in
the video sequence, it substantially improves the partition
selection accuracy, thereby achieving better compression performance
and lower CPU cycles.

The experiments were tested in RTC speed -6 setting:
cloud 1080p 500 kbps
17006 b/f, 37.086 dB, 5386 ms ->
16669 b/f, 37.970 dB, 5085 ms (>0.9dB gain and 6% faster)

pedestrian_area 1080p 500 kbps
53537 b/f, 36.771 dB, 18706 ms ->
51897 b/f, 36.792 dB, 18585 ms (4% bit-rate savings)

blue_sky 1080p 500 kbps
70214 b/f, 33.600 dB, 13979 ms ->
53885 b/f, 33.645 dB, 10878 ms (30% bit-rate savings, 25% faster)

jimred 400 kbps
13380 b/f, 36.014 dB, 5723 ms ->
13377 b/f, 36.087 dB, 5831 ms  (2% bit-rate savings, 2% slower)

Change-Id: Iffdb6ea5b16b77016bfa3dd3904d284168ae649c
2015-02-19 13:47:19 -08:00
Frank Galligan
e3167f7fbf Add vp9_sad32x32x4d_neon Neon intrinsic function.
On Nexus 7 speed -6 saw ~18% increase in perf.

Tested on Nexus 7, built with ndk r10d, gcc 4.9.

BUG=https://code.google.com/p/webm/issues/detail?id=908

Change-Id: I70ccdea0326750552ed946fb004507d6efe02d5c
2015-01-27 08:54:00 -08:00
Frank Galligan
9f574d0316 Add vp9_sad16x16x4d_neon Neon intrinsic function.
On Nexus 7 speed -6 saw ~15% increase in perf.

Tested on Nexus 7, built with ndk r10d, gcc 4.9.

BUG=https://code.google.com/p/webm/issues/detail?id=908

Change-Id: I4b2006b644c488f42bf06d8a22ef0e6120a96bf9
2015-01-27 08:42:17 -08:00
Frank Galligan
54fa956715 Add vp9_sad64x64x4d_neon Neon intrinsic function.
On Nexus 7 speed -6 saw ~30% increase in perf.

Tested on Nexus 7, built with ndk r10d, gcc 4.9.

BUG=https://code.google.com/p/webm/issues/detail?id=908

Change-Id: Id12af7d1883243c23e6692e898aea82299633d58
2015-01-27 08:33:40 -08:00
Frank Galligan
9f6eba419a Add Neon intrinsic vp9_fdct8x8_quant_neon
On Nexus 7 speed -5 got ~2%, -6 got ~15%, -7 and -8 got ~30%
increase in perf.

Tested on Nexus 7, built with ndk r10d, gcc 4.9.

Change-Id: I83246d63b96674d170098a572fa4fe28a05aaf51
2015-01-24 22:49:50 -08:00
JackyChen
65f60f8e8c Merge "SSE2 code for the filter in MFQE." 2015-01-23 11:08:16 -08:00
JackyChen
09673deba9 SSE2 code for the filter in MFQE.
The SSE2 code is from VP8 MFQE, reuse it in VP9. No change on VP8
side. In our testing, we achieve 2X speed by adopting this change.

Change-Id: Ib2b14144ae57c892005c1c4b84e3379d02e56716
2015-01-18 16:07:59 -08:00
Frank Galligan
6e7e1cf32f Add Neon intrinsics for vp9_avg_8x8_neon
On Nexus 7 speed -5, -6, -7, and -8 saw about a 1% increase
in perf for 480p. Speeds -5, -6, -7, and -8 saw about a 1.5%
increase in perf for 720p.

Tested on Nexus 7, built with ndk r10d, gcc 4.9.

Change-Id: Ibf17ebfd952a6aec941719bd8306df8ec4574bee
2015-01-15 15:32:40 -08:00
Frank Galligan
ec1d8387e1 Add 64x64 sub_pel_variance Neon function
On Nexus 7 speed -5, -6, -7, and -8 saw about a 15% increase
in perf for 480p. Speeds -5, -6, -7, and -8 saw about a 10%
increase in perf for 720p.

Tested on Nexus 7, built with ndk r10d, gcc 4.9.

Change-Id: I2fa5315845e3021c9a6e2ea47e52e68b398d8334
2015-01-14 08:36:24 -08:00
Frank Galligan
74d40cd507 Add 64x variance Neon functions
Add optimized Neon functions of:
vp9_variance32x64
vp9_variance64x32
vp9_variance64x64

On Nexus 7 speed -5 and -6 saw about a 4% increase in perf.
Speeds -7 and -8 saw about a 6% increase in perf.
Tested on Nexus 7, built with ndk r10d, gcc 4.9.

Change-Id: I5a81f13c9897eb927fa39662530f5524a0f768fa
2015-01-13 15:08:13 -08:00
Johann
377b6682f9 Disable vp9 _8_ loopfilters
Investigating https://code.google.com/p/chromium/issues/detail?id=443839

Change-Id: Ibb7485d835c5aa5e1d40f31715596ba8d208eedb
2015-01-06 19:26:11 -08:00
Jingning Han
d0f2377027 Revert "Revert "Removal of legacy zbin_extra / zbin_oq_value.""
This reverts commit 9946ee23e0.

Fix the ssse3 asm function.

Change-Id: I07f77a63aa98087626e45c4e87aa5dcafc0b0b07
2014-12-22 10:09:25 -08:00
Paul Wilkins
9946ee23e0 Revert "Removal of legacy zbin_extra / zbin_oq_value."
This reverts commit e9b586e21b.

Change-Id: I5b36e6727da6c05278d97e2c37b80c109f79bed4
2014-12-19 15:02:58 +00:00
Paul Wilkins
e9b586e21b Removal of legacy zbin_extra / zbin_oq_value.
zbin extra / zbin_oq_value was widely passed around,
hence removal touches a lot of code.

Change-Id: Idc94359735b60c38a160e4385ae09d5ca8b6b8e5
2014-12-18 16:49:11 +00:00
James Yu
aeeaa67987 VP9 common for ARMv8 by using NEON intrinsics 15
Re-write
- vp9_lpf_horizontal_4_dual_neon
in vp9_loopfilter_16_neon.c

Change-Id: Ie14f63d352f9564ad01db3939a61d91cf6d21a31
Signed-off-by: James Yu <james.yu@linaro.org>
2014-12-16 20:00:26 -08:00
James Yu
aa8dd897c1 VP9 common for ARMv8 by using NEON intrinsics 16
Add vp9_reconintra_neon.c
- vp9_v_predictor_4x4_neon
- vp9_v_predictor_8x8_neon
- vp9_v_predictor_16x16_neon
- vp9_v_predictor_32x32_neon
- vp9_h_predictor_4x4_neon
- vp9_h_predictor_8x8_neon
- vp9_h_predictor_16x16_neon
- vp9_h_predictor_32x32_neon
- vp9_tm_predictor_4x4_neon
- vp9_tm_predictor_8x8_neon
- vp9_tm_predictor_16x16_neon
- vp9_tm_predictor_32x32_neon

Change-Id: Ib5d54a4766a1b5127169045659974f33aa98376d
Signed-off-by: James Yu <james.yu@linaro.org>
2014-12-16 12:57:52 -08:00
James Yu
4f856cd7fa VP9 common for ARMv8 by using NEON intrinsics 06
Add vp9_iht8x8_add_neon.c
- vp9_iht8x8_64_add_neon

The assembly did not previously implement tx_type 0
BUG=716

Change-Id: Icfc99dd24f3d59047f9184a7d0c761ba7e3de934
Signed-off-by: James Yu <james.yu@linaro.org>
2014-12-15 12:18:06 -08:00
James Yu
6b71013277 VP9 common for ARMv8 by using NEON intrinsics 05
Add vp9_iht4x4_add_neon.c
- vp9_iht4x4_16_add_neon

The assembly did not previously implement tx_type 0
BUG=715

Change-Id: I60034d1568de034edba45c5cdd13f3d87dbc73b6
Signed-off-by: James Yu <james.yu@linaro.org>
2014-12-15 12:16:19 -08:00
James Yu
3f7c12dab9 VP9 common for ARMv8 by using NEON intrinsics 18
Add vp9_idct32x32_add_neon.c
- vp9_idct32x32_1024_add_neon

Change-Id: Ic598b772c28bd3487a8ead7a4598a66b25f9b00f
Signed-off-by: James Yu <james.yu@linaro.org>
2014-12-10 18:20:04 -08:00
James Yu
3cfed4bf76 VP9 common for ARMv8 by using NEON intrinsics 14
Add vp9_idct16x16_add_neon.c
- vp9_idct16x16_256_add_neon_pass1
- vp9_idct16x16_256_add_neon_pass2
- vp9_idct16x16_10_add_neon_pass1
- vp9_idct16x16_10_add_neon_pass2

Change-Id: I54d25b54a36f4371760f54e4036693aaea40a5de
Signed-off-by: James Yu <james.yu@linaro.org>
2014-12-10 18:19:54 -08:00
James Yu
ce76aeb00d VP9 common for ARMv8 by using NEON intrinsics 13
Add vp9_idct8x8_add_neon.c
- vp9_idct8x8_64_add_neon
- vp9_idct8x8_10_add_neon

Change-Id: I6ee7b4496765aa36ed52990f2ef73e9f24459610
Signed-off-by: James Yu <james.yu@linaro.org>
2014-12-10 14:56:54 -08:00
James Yu
8c25f4af6a VP9 common for ARMv8 by using NEON intrinsics 12
Add vp9_idct4x4_add_neon.c
- vp9_idct4x4_16_add_neon

Change-Id: I011a96b10f1992dbd52246019ce05bae7ca8ea4f
Signed-off-by: James Yu <james.yu@linaro.org>
2014-12-10 14:49:59 -08:00
James Yu
420f58f2d2 VP9 common for ARMv8 by using NEON intrinsics 11
Add vp9_idct16x16_1_add_neon.c
- vp9_idct16x16_1_add_neon

Change-Id: I7c6524024ad4cb4e66aa38f1c887e733503c39df
Signed-off-by: James Yu <james.yu@linaro.org>
2014-12-10 13:06:58 -08:00
James Yu
030ca4d0e5 VP9 common for ARMv8 by using NEON intrinsics 10
Add vp9_idct32x32_1_add_neon.c
- vp9_idct32x32_1_add_neon

Change-Id: If9ffe9a857228f5c67f61dc2b428b40965816eda
Signed-off-by: James Yu <james.yu@linaro.org>
2014-12-10 13:04:29 -08:00
James Yu
2772b45ac0 VP9 common for ARMv8 by using NEON intrinsics 09
Add vp9_idct8x8_1_add_neon.c
- vp9_idct8x8_1_add_neon

Change-Id: I9d23e01fa96013febbf64db6c76c6c955f14e3ff
Signed-off-by: James Yu <james.yu@linaro.org>
2014-12-10 12:52:33 -08:00
James Yu
9114f0afdb VP9 common for ARMv8 by using NEON intrinsics 08
Add vp9_idct4x4_1_add_neon.c
- vp9_idct4x4_1_add_neon

Change-Id: Ieab9af107dbd07a4f9503bc945890c90faccb8ac
Signed-off-by: James Yu <james.yu@linaro.org>
2014-12-10 12:49:28 -08:00
James Yu
01fc6f51e0 VP9 common for ARMv8 by using NEON intrinsics 07
Add vp9_convolve8_neon.c
- vp9_convolve8_horiz_neon
- vp9_convolve8_vert_neon

Change-Id: I0bdd99ff72d275223fe211ac7243c25a5a60cf87
Signed-off-by: James Yu <james.yu@linaro.org>
2014-12-09 20:03:07 -08:00
James Yu
893534a996 VP9 common for ARMv8 by using NEON intrinsics 04
Add vp9_convolve8_avg_neon.c
- vp9_convolve8_avg_horiz_neon
- vp9_convolve8_avg_vert_neon

Change-Id: I617971e37b02186fec5aca181f4f9622050ea2df
Signed-off-by: James Yu <james.yu@linaro.org>
2014-12-09 20:03:07 -08:00
James Yu
d12757f5c6 VP9 common for ARMv8 by using NEON intrinsics 03
Add vp9_copy_neon.c
- vp9_convolve_copy_neon

Change-Id: I291fc5423d06240876411bbceab03eae5ef585be
Signed-off-by: James Yu <james.yu@linaro.org>
2014-12-09 20:02:46 -08:00
Scott LaVarnway
617382a2e3 VP9 common for ARMv8 by using NEON intrinsics 02
Add vp9_avg_neon.c
- vp9_convolve_avg_neon

Change-Id: Id2c9d5bcfa37cff1a16417aba1656ff07bdf10fd
Signed-off-by: James Yu <james.yu@linaro.org>
2014-12-09 19:00:21 -08:00
James Yu
5b098b1825 VP9 common for ARMv8 by using NEON intrinsics 01
Add vp9_loopfilter_neon.c
- vp9_lpf_horizontal_4_neon
- vp9_lpf_vertical_4_neon
- vp9_lpf_horizontal_8_neon
- vp9_lpf_vertical_8_neon

Change-Id: I97a0d7b399a431c21ee77396be3d5f5a1f7ebccb
Signed-off-by: James Yu <james.yu@linaro.org>
2014-12-09 12:26:56 -08:00
Peter de Rivaz
a306bd8274 Use the RTC optimizations when in high bitdepth mode.
Change 72193 made the encoder behave differently
when configured with and without high bitdepth.
This change means the same algorithm is used for both.

Change-Id: I707a44a94afca773a9e0c2f7ebeeea83030257c5
2014-12-04 15:48:42 -08:00
Marco
8fd3f9a2fb Enable non-rd mode coding on key frame, for speed 6.
For key frame at speed 6: enable the non-rd mode selection in speed setting
and use the (non-rd) variance_based partition.

Adjust some logic/thresholds in variance partition selection for key frame only (no change to delta frames),
mainly to bias to selecting smaller prediction blocks, and also set max tx size of 16x16.

Loss in key frame quality (~0.6-0.7dB) compared to rd coding,
but speeds up key frame encoding by at least 6x.
Average PNSR/SSIM metrics over RTC clips go down by ~1-2% for speed 6.

Change-Id: Ie4845e0127e876337b9c105aa37e93b286193405
2014-12-03 09:18:08 -08:00
Peter de Rivaz
7e40a55ef9 Added high bitdepth sse2 transform functions
Also removes some spurious changes in common/vp9_blockd.h which
was introduced by a rebase issue between nextgen and master branches.

Change-Id: If359f0e9a71bca9c2ba685a87a355873536bb282
(cherry picked from commit 005d80cd05)
(cherry picked from commit 08d2f54800)
(cherry picked from commit 4230c2306c)
2014-12-02 11:16:24 -08:00
Debargha Mukherjee
02355a4abf Merge "Added highbitdepth sse2 acceleration for quantize" 2014-11-21 16:08:47 -08:00
Peter de Rivaz
a7b2d09f36 Added highbitdepth sse2 acceleration for quantize
Also includes block error.

(This patch is mostly cherry picked from
commit db7192e0b0)

Change-Id: Idef18f90b111a0d0c9546543d3347e551908fd78
2014-11-19 23:55:19 -08:00
Jingning Han
c42715b721 Enable ssse3 version of vp9_fdct8x8_quant
It improves the speed performance of vp9_fdct8x8_quant_sse2 by
about 5%.

Change-Id: I74b093ba4d81df64caf71ac7693f3d917f673097
2014-11-19 22:14:19 -08:00
Jingning Han
bf63652d34 Merge "Combine fdct8x8 and quantization process" 2014-11-19 11:17:44 -08:00
Jingning Han
ce77a7bcb0 Merge "Add sse2 version for vp9_quantize_fp" 2014-11-19 11:17:36 -08:00
Jingning Han
c6908fd5f7 Combine fdct8x8 and quantization process
This commit reworks the forward transform and quantization process
for 8x8 block coding. It combines the two operations in a single
function to save a store/load stage of the original transform
coefficients. Overall the speed -6 is slightly faster (around 1%
range). The compression performance of speed -6 is improved by
3.4%.

Change-Id: Id6628daef123f3e4649248735ec2ad7423629387
2014-11-18 18:10:56 -08:00
Jingning Han
2d3cc8ea2b Add sse2 version for vp9_quantize_fp
vp9_quantize_fp is the quantization process used by rtc coding
mode. This commit adds a sse2 implementation of it. The
implementation is modified based on vp9_quantize_b_sse2. No speed
difference from ssse3 version.

Change-Id: I24949c5b27df160b4f35117d28858d269454e64a
2014-11-18 09:01:41 -08:00
Yaowu Xu
1687c47bfd change to call vp9_refining_search_sad() directly
The function pointer in compressor instance does not change, so this
commit changes to call the function directly.

Change-Id: I9c9c460e3475711c384b74c9842f0b4f3d037cc5
2014-11-17 11:30:17 -08:00
Peter de Rivaz
48032bfcdb Added sse2 acceleration for highbitdepth variance
Change-Id: I446bdf3a405e4e9d2aa633d6281d66ea0cdfd79f
(cherry picked from commit d7422b2b1e)
(cherry picked from commit 6d741e4d76)
2014-11-14 15:18:53 -08:00
Peter de Rivaz
7eee487c00 Added highbitdepth sse2 SAD acceleration and tests
Change-Id: I1a74a1b032b198793ef9cc526327987f7799125f
(cherry picked from commit b1a6f6b9cb)
2014-11-12 14:25:45 -08:00
Yunqing Wang
687c56e802 Merge "SAD32xh and SAD64xh for AVX2" 2014-10-20 12:37:55 -07:00
levytamar82
7045aec00a SAD32xh and SAD64xh for AVX2
All sad function that process above 32 consecutive elements are optimized
for AVX2:
vp9_sad64x64
vp9_sad64x32
vp9_sad32x64
vp9_sad32x32
vp9_sad32x16
vp9_sad64x64_avg
vp9_sad64x32_avg
vp9_sad32x64_avg
vp9_sad32x32_avg
vp9_sad32x16_avg
The functions that appeared as a hotspot is vp9_sad32x32 and vp9_sad64x64
vp9_sad32x32 was optimized by 68% and vp9_sad64x64 was optimized by 90%
both of them gave and overall ~2.3% user level gain

Change-Id: Iccf86b375a2b54c5fbbe685902ead0c9a561b9fd
2014-10-19 13:59:10 -07:00
Peter de Rivaz
73ae6e495c Add highbitdepth function for vp9_avg_8x8
Cherry-picked from https://gerrit.chromium.org/gerrit/#/c/71914/
(a92f987a6b) on highbitdepth branch.

Change-Id: I6903e4e4cb57d90590725c8a1c64c23da7ae65e8
2014-10-17 17:04:37 -07:00
Alex Converse
7497d2fb23 Add a 32-bit friendly sse2 quantizer.
This is based on the 64-bit ssse3 quantizer.

1.1x speedup for screen content at speed 7.

Change-Id: I57d15415ef97c49165954bbe3daaaf9318e37448
2014-10-14 11:37:41 -07:00
Deb Mukherjee
9a29fdbae7 Merge "Rename highbitdepth functions to use highbd prefix" 2014-10-09 15:39:56 -07:00
Deb Mukherjee
1929c9b391 Rename highbitdepth functions to use highbd prefix
Uses highbd_ prefix convention consistently.

Change-Id: I58f7f799a7ff8e32701bcd71c955bcf1cdd4581e
2014-10-09 14:40:40 -07:00
James Zern
caa0f81914 vp9_rtcd_defs: fix vp9_avg_8x8 declaration
vp9_avg_8x8 does not depend on x86inc, fixes 32-bit OS X build

Change-Id: I709b874ea84bf57c8cdb5ac7d43eecc6b8c1a2dd
2014-10-09 10:44:42 +02:00
Jim Bankoski
0ce51d823f experimental : partition using 1/8 x 1/8 image
The concept:

There's too much noise in source pixels for variance and at low bitrate
the reconstructed looks nothing like the source so we have problems
getting good partitionings with either.   This skirts the issue by using
a box blur scaled down version for variance calculations.  To compare
against source_var_ moved keyframe to be rd based like source_var.

Change-Id: Ie3babdbfadae324b7b5a76bea192893af27f0624
2014-10-07 16:36:14 -07:00
JackyChen
a9f479682a Merge "Add SSE2 code and unit test for VP9 denoiser." 2014-10-07 10:51:55 -07:00
JackyChen
80465dae88 Add SSE2 code and unit test for VP9 denoiser.
This SSE2 is based on VP8 denoiser's SSE2 code. In VP8, there are
only 16x16 blocks in denoiser, while in VP9, there are 13 different
block sizes.

By adding this SSE2 code, the improvement of encoder speed is around
20%(using C code vs using SSE2 code), vary for different clips.

The unit test for VP9 denoiser is to confirm that the SSE2 code is
bit-exact with the C code. The unit test covers all block size.

Change-Id: Ic8d8ac26db4ea40a5f146b5678a065af07eaaa3d
2014-10-06 15:27:40 -07:00
Deb Mukherjee
d50716face Incorporate WRAPLOW macro into non-highbitdepth tx
Incorporates the WRAPLOW macro into the non-highbitdepth transforms
to aid hardware verification between a software C model and an
intended hardware implementation though the use of the configure
options: --enable-experimental --enable-emulate-hardware.
Note that to avoid further discrepancies between the sse/sse2
implementations of the transforms and the C implementation, when the
emulate hardware option is invoked, we also disable sse/sse2/etc.

Also incudes some minor cleanups/renaming etc.

Change-Id: Ib864d8493313927d429cce402982f1c8e45b3287
2014-10-03 11:38:05 -07:00
Deb Mukherjee
a160d72522 High-bitdepth bugfixes
Miscellaneous bug-fixes for high bitdepth functionality.
With this patch, high bit-depth profiles become mostly functional,
except for an intermittent assert failure issue that is being
tracked.

Change-Id: I6a7fcbdcf1e5b09842e88535f8442d2e1230748c
2014-10-01 14:18:11 -07:00
Deb Mukherjee
872b207b78 Moves transform type defines to vp9_common
Moves transform type defines to vp9_common.h from vp9_idct.h
so that they can be included in vp9_rtcd_defs.pl safely.

Change-Id: Id5106227bee5934f7ce8b06f2eb9fa8a9a2e0ddb
2014-09-30 19:44:17 -07:00
James Zern
4a296e6baa Revert "Fix compiling error in vp9_idct.h"
This reverts commit eafc8c9c40.

tran_low_t/tran_high_t don't belong in a public header, they're private.
Similarly the public headers shouldn't rely on config defines,
vpx_config.h isn't installed.

Change-Id: I194ec273598da418df8dd727b6c0e78a556740ad
2014-09-30 16:08:55 -07:00
Jingning Han
eafc8c9c40 Fix compiling error in vp9_idct.h
This commit fixes a compiling error in vp9_idct.h, where the codec
checks that the intermediate steps of transformation fit within
16-bit length. The issue was due to broken file dependency.

Change-Id: Ib22bba13a1e6df28489cb23d6774c561969f1fdc
2014-09-30 09:11:59 -07:00
Deb Mukherjee
931ed516ba High bit-depth loop/arf/postproc filter functions
Adds high-bitdepth loopfilter, temporal filter and postproc functions

Change-Id: I81c8a9176890784686bc4f2af0d550d243b3b2d3
2014-09-23 16:20:43 -07:00
Deb Mukherjee
0d3c3d3ce7 Adds high bitdepth convolve, interpred & scaling
Change-Id: Ie51c352a6b250547207cbc1ebba833a01ed053e3
2014-09-18 07:26:17 -07:00
Deb Mukherjee
81a8138fc3 Adding high-bitdepth intra prediction functions
Change-Id: I6f5cb101e2dc57c3d3f4d7e0ffb4ddbed027d111
2014-09-16 15:04:39 -07:00
Deb Mukherjee
5cd0aab81a Adds high bitdepth quantization functions
Adds various high bitdepth quantization functions.

Change-Id: I36fc0bf75a1bd15128ed271df8723de0ac134b0c
2014-09-16 14:55:37 -07:00
Yaowu Xu
601f3a886e Fix a performance regression
This commit adds back sse2 or ssse3 optimized versio of a couple of
functions, fixes a ~10% performance regression.

Change-Id: I049786906e5a641224dced63c6492aec9d86d183
2014-09-16 11:18:46 -07:00
Deb Mukherjee
10783d4f3a Adds high bitdepth transform functions and tests
Adds various high bitdepth transform functions and tests.
Much of the changes are related to using typedefs tran_low_t
and tran_high_t for the final transform cofficients and intermediate
stages of the transform computation respectively rather than fixed
types int16_t/int. When vp9_highbitdepth configure flag is off,
these map tp int16_t/int32_t, but when the flag is on, they map
to int32_t/int64_t to make space for needed extra precision.

Change-Id: I3c56de79e15b904d6f655b62ffae170729befdd8
2014-09-11 19:56:33 -07:00
Deb Mukherjee
1e4136d35d Adds high bit depth sad and variance functions
Moves high bit depth sad/var functions from highbitdepth
branch to master.

Change-Id: If03845d8ef9c9c494e13350e7a587c289306b94d
2014-09-11 17:30:44 -07:00
Johann
8645a53039 Allow specifying opt dependencies
If optimizations use more than one cpu feature, allow
specifying them so that '--disable-X' still works

https://code.google.com/p/webm/issues/detail?id=854

Change-Id: I3108ea37b397371a2be84dd5f2380b304db23f18
2014-09-11 13:43:48 -07:00
Dmitry Kovalev
980abf6078 Fixing Mac OS build.
Change-Id: Ifae8906185a868a07685eb7a7da2484af95e70a7
2014-09-08 08:53:12 -07:00
Dmitry Kovalev
89963bf586 Merge "Removing postproc mmx code." 2014-09-05 18:11:08 -07:00
Dmitry Kovalev
1100e262c5 Removing postproc mmx code.
Removed functions:
* vp9_post_proc_down_and_across_mmx
* vp9_mbpost_proc_down_mmx
* vp9_plane_add_noise_mmx

They all have sse2 equivalent.

Change-Id: I59c1fac12b7c96ca4538d455e4400c2b7875feff
2014-09-05 11:52:50 -07:00
James Zern
a8083449e9 fix x86-darwin* build
vp9_variance_sse2.c contains a mix of intrinsics and references to
assembly which uses x86inc.asm; it's conditionally included as a result.

Change-Id: I254451483a65881c0b8e18e27bf0c3ddef60c4ec
2014-09-04 23:32:13 -07:00
Dmitry Kovalev
490943552f Removing unused function prototypes.
Change-Id: Ia5e383e2cf18052f6f1eacf8b9495ab8e4d58878
2014-09-04 14:26:30 -07:00
Dmitry Kovalev
48197f0a70 Adding sse2 variant for vp9_mse{8x8, 8x16, 16x8}.
Change-Id: I6786d25ce4f32b8d8912f2d239a45ca15b310c4b
2014-09-03 19:02:14 -07:00
Dmitry Kovalev
318fc0c34f Removing MMX SAD calculation code.
Removed functions:
* vp9_sad_16x16_mmx
* vp9_sad_8x16_mmx
* vp9_sad_16x8_mmx
* vp9_sad_8x8_mmx
* vp9_sad_4x4_mmx

Change-Id: Ic5174b93b64d65d846f0c11e72cab149e9472bc3
2014-09-02 14:41:36 -07:00
Dmitry Kovalev
12cd6f421d Removing variance MMX code.
Removed functions:
* vp9_mse16x16_mmx
* vp9_get_mb_ss_mmx
* vp9_get4x4var_mmx
* vp9_get8x8var_mmx
* vp9_variance4x4_mmx
* vp9_variance8x8_mmx
* vp9_variance16x16_mmx
* vp9_variance16x8_mmx
* vp9_variance8x16_mmx

They all have SSE2 equivalent.

Change-Id: I3796f2477c4f59b35b4828f46a300c16e62a2615
2014-08-29 10:26:42 -07:00
Yaowu Xu
23c88870ec Merge "Fix bug 804" 2014-08-21 08:56:32 -07:00
levytamar82
69a5f5ecf7 Fix bug 807
in the sub_pixel_*variance* function the dst is aligned to 16 bytes and not
to 32 bytes - now load unaligned data

Change-Id: I2e0b9745543697efc56fefa32857ea10117af135
2014-08-07 18:51:02 -07:00
levytamar82
839911fb6d Fix bug 804
A bug in Microsoft compiler was found in the function
vp9_filter_block1d16_v8_avx2 and a workaround applied.
the bug occur when there was 4 consecutive maddubs + min + adds
intrinsic instructions.

Change-Id: I83499faeb70971e650e5663fd2490360ddb1a51b
2014-08-07 15:09:24 -07:00
levytamar82
af10457e02 Fix bug 806
in the function sad32x32x4d and sad64x64x4d the source is aligned to 16 bytes
and not to 32 bytes - the load is now unaligned.

Change-Id: I922fdba56d0936b5cf72e4503519f185645a168c
2014-08-07 14:13:30 -07:00
Frank Galligan
5f8fa13258 Merge "Added vp9_sad8x8_neon()" 2014-08-01 14:11:38 -07:00
Scott LaVarnway
98165ec074 Neon version of vp9_sub_pixel_variance8x8(),
vp9_variance8x8(), and vp9_get8x8var().

On a Nexus 7, vpxenc (in realtime mode, speed -12)
reported a performance improvement of ~1.2%.

Change-Id: I8a66ac2a0f550b407caa27816833bdc563395102
2014-08-01 11:35:55 -07:00
Frank Galligan
5487b6067c Merge "Neon version of vp9_sub_pixel_variance32x32()," 2014-08-01 09:46:37 -07:00
Scott LaVarnway
545be78136 Added vp9_sad8x8_neon()
Change-Id: I3be8911121ef9a5f39f6c1a2e28f9e00972e0624
2014-08-01 06:36:18 -07:00
Scott LaVarnway
6f4b8dcdc2 Neon version of vp9_subtract_block()
On a Nexus 7, vpxenc (in realtime mode, speed -12)
reported a performance improvement of ~3.2%

Change-Id: I8862497264142171b7efc32df1a67714a23539f4
2014-07-31 09:28:06 -07:00
Scott LaVarnway
d39448e2d4 Neon version of vp9_sub_pixel_variance32x32(),
vp9_variance32x32(), and vp9_get32x32var().

Change-Id: I8137e2540e50984744da59ae3a41e94f8af4a548
2014-07-31 08:00:36 -07:00
Scott LaVarnway
d4a37db5b8 Neon version of vp9_quantize_fp()
On a Nexus 7, vpxenc (in realtime mode, speed -12)
reported a performance improvement of ~12.4%

Change-Id: Id29d215acf58bb108489e218a259adf74b4768d7
2014-07-30 09:33:46 -07:00
Scott LaVarnway
521cf7e879 Neon version of vp9_sub_pixel_variance16x16(),
vp9_variance16x16(), and vp9_get16x16var().

On a Nexus 7, vpxenc (in realtime mode, speed -12)
reported a performance improvement of ~16.7%.

Change-Id: Ib163aa99f56e680194aabe00dacdd7f0899a4ecb
2014-07-30 08:17:32 -07:00
Scott LaVarnway
d19d222db6 Added vp9_fdct8x8_neon(), vp9_fdct8x8_1_neon()
On a Nexus 7, vpxenc (in realtime mode, speed -12)
reported a performance improvement of ~3.7%.

Change-Id: I428c72c40df82c6d537955e320a8debf99343004
2014-07-29 08:56:05 -07:00
levytamar82
4ba92dc5ab Fix bug 805
Remove all the redundant dct functions (dct4x4, dct8x8)
in avx2 except dct32x32 those functions were copied originally from dct_sse2

Change-Id: I742576fbf5175f3ac09f2076976a9247b259323e
2014-07-28 15:46:01 -07:00
Scott LaVarnway
ba0652e83a Merge "Added vp9_sad64x64_neon(), vp9_sad32x32_neon()" 2014-07-17 11:42:16 -07:00
Scott LaVarnway
696fa52eaa Added vp9_sad64x64_neon(), vp9_sad32x32_neon()
and vp9_sad16x16_neon()

On a Nexus 7, vpxenc (in realtime mode, speed -6)
reported a performance improvement of ~17%.

Change-Id: I91e070cde2973451083d3f3d63b49b7886de9a85
2014-07-16 12:54:46 -07:00
Yunqing Wang
75cd57503d Refactor vp9_diamond_search_sad function
Currently, vp9_diamond_search_sadx4() is only called when sse3 is
enabled, which is improper since sse2 optimization of sdx4df
functions are available. Changed to always use
vp9_diamond_search_sadx4().

Change-Id: I4b95d6b7a3c6c645783c373f0ba8d645ece24717
2014-07-10 09:19:03 -07:00
Yunqing Wang
30117a576d Refactor refining_search_sad code
There are sse2 optimization of sdx4df functions. Instead of calling
vp9_refining_search_sadx4 only when sse3 is enabled, call it always.

Change-Id: I24f93818f7d4209d1425039e0eb099ff9ff08fe9
2014-07-09 16:50:11 -07:00
Jingning Han
9ad1b9fc67 Re-design quantization process for 32x32 transform block
This commit enables a new quantization process for 32x32 2D-DCT
transform coefficient blocks. It improves the compression
performance of speed 5 by 1.4%. The overall compression gains of
speed 5 due to the new quantization scheme is 4.7%. It also includes
the SSSE3 implementation of the 32x32 quantization process.

Change-Id: I0855b124fd6462418683f783f5bcb44255c9993b
2014-07-08 16:55:28 -07:00
Jingning Han
9ac2f66320 Re-design quantization process
This commit re-designs the quantization process for transform
coefficient blocks of size 4x4 to 16x16. It improves compression
performance for speed 7 by 3.85%. The SSSE3 version for the
new quantization process is included.

The average runtime of the 8x8 block quantization is reduced
from 285 cycles -> 255 cycles, i.e., over 10% faster.

Change-Id: I61278aa02efc70599b962d3314671db5b0446a50
2014-07-01 17:00:07 -07:00
James Zern
88df435d6b Merge "vp9_rtcd: correct avx2 references" 2014-06-16 17:39:13 -07:00
Jingning Han
d5ae43318e Merge "Fast computation path for forward transform and quantization" 2014-06-12 11:59:52 -07:00
Jingning Han
ccba289f8d Fast computation path for forward transform and quantization
This commit enables a fast path computational flow for forward
transformation. It checks the sse and variance of prediction
residuals and decides if the quantized coefficients are all
zero, dc only, or more. It then selects the corresponding coding
path in the forward transformation and quantization stage.

It is currently enabled in rtc coding mode. Will do it for rd
coding mode next.

In speed -6, the runtime for pedestrian_area 1080p at 1000 kbps
goes down from 14234 ms to 13704 ms, i.e., about 4% speed-up.
Overall coding performance for rtc set is changed by -0.18%.

Change-Id: I0452da1786d59bc8bcbe0a35fdae9f623d1d44e1
2014-06-12 11:10:54 -07:00
James Zern
9f3a0dbb5e vp9_rtcd: correct avx2 references
s/"\$avx2_x86inc"/"avx2"/

avx2 code is all intrinsics and as a result doesn't rely on x86inc.asm

Change-Id: I76ad39474d8a00658f3e43131830ef0f4f34772a
2014-06-10 16:26:36 -07:00
James Zern
520cb3f39f vp9_sub_pixel_*variance*: disable avx2 variants
tests failing under Win32/Win64

+ variance_test: add missing avx2 functions (partially disabled)

Change-Id: I6abc0657ea076379ab9ca65c12678b9ea199849d
2014-06-10 16:11:15 -07:00
James Zern
d3ff009d84 vp9_sad*x4d: disable avx2 variants
tests failing under Win32/Win64

+ sad_test: add missing avx2 functions (disabled)

Change-Id: I8224fba2b270f6039ab1877d71e1e512f0081856
2014-06-10 16:10:12 -07:00
James Zern
dd9f502933 vp9_f(dct|ht): disable avx2 variants
tests failing under Win32/Win64

+ dct16x16_test: add missing avx2 functions (partially disabled)

exercises the forward transforms
no idct/iht implementations, so the c-code is used

Change-Id: I04f64a457fa0828a00f32b5c9fe4f55294f21f61
2014-06-09 18:48:11 -07:00
James Zern
5704578f5f convolve: disable avx2 variants
tests failing under Win32/Win64

Change-Id: I5d49d11911bcda3a832b14efe5500d22597bedcf
2014-06-09 18:42:03 -07:00
Jingning Han
0c4a4225ec Merge "Enable SSSE3 inverse 2D-DCT with 10 non-zero coeffs" 2014-06-03 16:51:39 -07:00
Dmitry Kovalev
f7ff24cdd0 Reusing existing vp9_get{8x8, 16x16}var() instead of new ones.
Change-Id: I87b7c657d8813d7fb383ab519d150c0ffb1dd377
2014-05-29 11:14:06 -07:00
Jingning Han
6d21cbd20b Enable SSSE3 inverse 2D-DCT with 10 non-zero coeffs
This commit enables SSSE3 implementation of the inverse 2D-DCT
with only first 10 coefficients non-zero. It reduces the runtime
of SSE2 version from 745 cycles to 538 cycles, i.e., 27% speed-up.

Change-Id: I18ba4128859b09c704a6ee361d69a86c09fe8dfe
2014-05-28 10:53:33 -07:00
Yunqing Wang
1f2200080b Revert "Making vp9_get_sse_sum_{8x8, 16x16} static."
This reverts commit e8bbb3d9db.

Change-Id: Ie368d36fd249d323d859d208609c711f04537bbc
2014-05-27 13:37:08 -07:00
Deb Mukherjee
444f93945b Merge "Remove Wextra warnings from vp9_sad.c" 2014-05-27 11:54:05 -07:00
Jingning Han
48b0891370 Inverse 16x16 2D-DCT SSSE3 implementation
This commit enables the SSSE3 implementation of full inverse 16x16
2D-DCT. The unit runtime goes down from 1642 cycles to 1519 cycles,
about 7% speed-up.

Change-Id: I14d2fdf9da1fb4ed1e5db7ce24f77a1bfc8ea90d
2014-05-23 15:09:35 -07:00
Deb Mukherjee
916550428d Remove Wextra warnings from vp9_sad.c
As a side-effect, the sad unit tests for VP8 and VP9
had to be separated.

Change-Id: I068cc2391eed51e9b140ea6aba78338c5fec8d71
2014-05-22 22:21:16 -07:00
Deb Mukherjee
a185bc3350 Extends temporal filtering to work for 422 data
This is needed for profiles 1 and 2.

Change-Id: I5dd7644c2932d055ab89e050d4be7d4117cd1028
2014-05-20 15:19:40 -07:00
Jim Bankoski
ec82d2dfec Merge "Revert "Remove Wextra warnings from vp9_sad.c"" 2014-05-15 11:54:23 -07:00
Yunqing Wang
c661cf0dad Merge "AVX2 To VP9 Block Error Optimization" 2014-05-15 11:29:29 -07:00
Jim Bankoski
a16794dd31 Revert "Remove Wextra warnings from vp9_sad.c"
This reverts commit 7ab9a9587b

Nightly test http://build.webmproject.org/jenkins/view/libvpx-nightly-tests/job/libvpx%20unit%20tests%20(valgrind-2)/arch=x86_64-linux-gcc,filter=-*VP8*:*Large.*/276/console

Failed 

This patch did not address all the assembly issues 
some of the vp8 assembly counts on 5 arguments being passed in to this function:   

one example : vp8_sad8x16_wmt

Please address or split this into vp9 and vp8 patches.

Change-Id: I78afcc171649894f887bb8ee3c66de24aaddc7ca
2014-05-15 08:31:20 -07:00
levytamar82
1fbab853c8 AVX2 To VP9 Block Error Optimization
vp9_block_error_sse2 can only handle 16 bytes at a time but
the function requires to handle a sequence of 32 bytes at a time
so each 16 bytes is handled in a different register.
With AVX2 optimization the 32 bytes can be handled in one register instead
of two in the SSE2
The vp9_block_error was optimized by 85%.
The user level was optimized by 1.2%

Change-Id: Ia8fffe60e61eff7432a5fbd538757894f6c319fd
2014-05-14 11:51:07 -07:00
Deb Mukherjee
7ab9a9587b Remove Wextra warnings from vp9_sad.c
As a side-effect, the max_sad check is removed from the
C-implementation of VP8, for consistency with VP9, and to
ensure that the SAD tests common to VP8/VP9 pass.
That will make the VP8 C implementation of sad a little slower
but given that is rarely used in practice, the impact will be
minimal.

Change-Id: I7f43089fdea047fbf1862e40c21e4715c30f07ca
2014-05-14 03:17:31 -07:00
Johann
ce23931a3f Only build neon assembly for armv7 targets
Allow selectively building just the intrinsics for armv8

Change-Id: I2f29b2e4508b8b8e5649c2906b3159ad1d4ec477
2014-05-12 08:52:02 -07:00
Alex Converse
ec8a3272fa Merge "Add an x86inc MMX fwht4x4." 2014-05-09 13:48:49 -07:00
Jingning Han
9412785b02 Merge changes I3edd4b95,I4514f974,Ie7fa4386
* changes:
  Turn on unit tests for SSSE3 8x8 forward and inverse 2D-DCT
  Change eob threshold for partial inverse 8x8 2D-DCT to 12
  SSSE3 8x8 inverse 2D-DCT with first 10 coeffs non-zero
2014-05-09 09:58:39 -07:00
Alex Converse
b5422fab46 Add an x86inc MMX fwht4x4.
Change-Id: Ib0a73d4863478f9b8a00976379d25d2f6ebbb197
2014-05-08 12:01:27 -07:00
Jingning Han
41a350a83d Change eob threshold for partial inverse 8x8 2D-DCT to 12
The scanning order has the first 12 coefficients of the 8x8 2D-DCT
sitting in the top left 4x4 block. Hence the partial inverse 8x8
2D-DCT allows to handle cases with eob below 12.

The overall runtime of the inverse 8x8 2D-DCT unit is reduced from
166 cycles (using SSE2) to 150 cycles (using SSSE3).

Change-Id: I4514f9748042809ac84df4c14382c00f313f1cd2
2014-05-08 09:48:58 -07:00
Jingning Han
9e7b09bc5d SSSE3 8x8 inverse 2D-DCT with first 10 coeffs non-zero
This commit enables ssse3 assembly implementation of the 8x8
inverse 2D-DCT with only first 10 coefficients non-zero. The
average runtime for this unit goes down from 198 cycles to 129
cycles (34.8% faster).

Change-Id: Ie7fa4386f6d3a2fe0d47a2eb26fc2a6bbc592ac7
2014-05-07 17:40:02 -07:00
Paul Wilkins
33b1c457ed Revert "Add an MMX fwht4x4"
Includes changes that are not compatible with VS windows builds.
Amongst other things stdint.h is not supported in VS.

This reverts commit 89fbf3de50.

Change-Id: Ifa86d7df250578d1ada9b539c9ff12ed0c523cdd
2014-05-07 12:53:27 +01:00
Alex Converse
75d05d5ed4 Merge "Add an MMX fwht4x4" 2014-05-06 11:12:27 -07:00
Jingning Han
d289deb04c Merge "SSSE3 implementation of full inverse 8x8 2D-DCT" 2014-05-06 09:17:22 -07:00
Dmitry Kovalev
e8bbb3d9db Making vp9_get_sse_sum_{8x8, 16x16} static.
Change-Id: Ifb7937c977308c682986f0ce9645a0807d2aa46a
2014-05-05 19:12:38 -07:00
Alex Converse
89fbf3de50 Add an MMX fwht4x4
7% faster encoding a desktop lossless at RT speed 4.

Change-Id: I41627f5b737752616b6512bb91a36ec45995bf64
2014-05-05 15:10:48 -07:00
Jingning Han
52ae97b6aa SSSE3 implementation of full inverse 8x8 2D-DCT
This commit enables SSSE3 version full inverse 8x8 2D-DCT and
reconstruction. It makes the runtime of vp9_idct8x8_64_add down
from 256 cycles (SSE2) to 246 cycles.

Change-Id: I0600feac894d6a443a3c9d18daf34156d4e225c3
2014-05-05 10:49:27 -07:00
Jingning Han
39761eb5d6 Merge "Enable SSSE3 implementation of 8x8 forward 2D-DCT" 2014-04-30 13:41:36 -07:00
Dmitry Kovalev
d2bc8816a1 Merge "Adding search_site_config struct." 2014-04-29 16:59:47 -07:00
Jingning Han
1eaa3a76dc Enable SSSE3 implementation of 8x8 forward 2D-DCT
Assembly implementation of ssse3 8x8 forward 2D-DCT. The current
version is turned on only for x86_64. The average unit runtime
goes from 157 cycles down to 136 cycles, i.e., about 12.8% faster.
This translates into about 1.5% speed-up for pedestrian_area 1080p
at speed 2.

Change-Id: I0f12435857e9425ed7ce12541344dfa16837f4f4
2014-04-29 15:49:18 -07:00
Dmitry Kovalev
aa464eca5e Adding search_site_config struct.
Change-Id: I2ad333553e673dbabcdc0f0366aea311e90849bf
2014-04-29 10:34:53 -07:00
Dmitry Kovalev
6e01079cc0 Removing unused vp9_variance_halfpixvar*() functions.
Change-Id: I99695564a3aa9bc8c79ac0a551d257e2ff3ad3c3
2014-04-25 11:50:07 -07:00
Dmitry Kovalev
03e7deae4f Removing unused vp9_sub_pixel_mse* functions.
Change-Id: I8d906da3bd6de0d3042676846f61a8b2a3444508
2014-04-24 11:49:12 -07:00
Dmitry Kovalev
63fa722179 Removing unused cost arguments from mcomp functions.
Change-Id: Id81a76d18be6b2de69f81bb563d74c3bb356d434
2014-04-11 10:24:36 -07:00
Yunqing Wang
4e66293fcb Use source frame difference to make partition decision
Calculate the difference variance between last source frame and
current source frame. The variance is calculated at 16x16 block
level. The variances are compared to several thresholds to decide
final partition sizes.

An adaptive strategy is implemented to decide using
SOURCE_VAR_BASED_PARTITION or FIXED_PARTITION based on motions
in the video. The switching test is done once every
search_type_check_frequency frames.

The selection of source_var_thresh needs to be investigated
further later.

RTC set Borg test showed 0.424% overall psnr gain, and 0.357%
ssim gain. For clips with large enough static area, the
encoding speedup is around 2% to 15%.

Change-Id: Id7d268f1d8cbca7fb8026aa4a53b3c77459dc156
2014-04-08 17:03:02 -07:00
levytamar82
0fa8b668c1 AVX2 SAD Optimization:
2 functions were optimized for avx2 by using full 256 bit register
In order to handle 32 elements in parallel instead of only 16 in parallel:
1. vp9_sad32x32x4d
2. vp9_sad64x64x4d

The function level gain is 66% and the user level gain is ~1%.

Change-Id: I4efbb3bc7d8bc03b64b6c98f5cd5c4a9dd3212cb
2014-03-21 13:53:32 -07:00
James Zern
805078a1bf build: convert rtcd.sh to perl
significantly speeds up file generation.

the goal of this change is to convert rtcd.sh to perl as directly as
possible to allow for simple comparison. future changes can make it more
perl-like.

---
Linux
    [CREATE] vpx_scale_rtcd.h
real    0m0.485s ->    0m0.022s
    [CREATE] vp8_rtcd.h
real    0m4.619s ->    0m0.060s
    [CREATE] vp9_rtcd.h
real    0m10.102s ->    0m0.087s

Windows
    [CREATE] vpx_scale_rtcd.h
real    0m8.360s ->    0m0.080s
    [CREATE] vp8_rtcd.h
real    1m8.083s ->    0m0.160s
    [CREATE] vp9_rtcd.h
real    2m6.489s ->    0m0.233s

Change-Id: Idfb71188206c91237d6a3c3a81dfe00d103f11ee
2014-03-03 14:47:11 -08:00