generic-library/vpx

Author	SHA1	Message	Date
Jingning Han	3cc8c8c3a0	Merge "Refactor intra predictor block"	2013-06-25 19:46:55 -07:00
Jingning Han	d19ea3861d	Refactor intra predictor block Remove vp9_intra4x4_predict(). Use the common intra prediction function for all block sizes. Change-Id: Ibd19d51dfa3da8bbdfb79ddeb81530b2e2089560	2013-06-25 16:33:13 -07:00
Ronald S. Bultje	c24d922396	Add averaging-SAD functions for 8-point comp-inter motion search. Makes first 50 frames of bus @ 1500kbps encode from 3min22.7 to 3min18.2, i.e. 2.3% faster. In addition, use the sub_pixel_avg functions to calc the variance of the averaging predictor. This is slightly suboptimal because the function is subpixel-position-aware, but it will (at least for the SSE2 version) not actually use a bilinear filter for a full-pixel position, thus leading to approximately the same performance compared to if we implemented an actual average-aware full-pixel variance function. That gains another 0.3 seconds (i.e. encode time goes to 3min17.4), thus leading to a total gain of 2.7%. Change-Id: I3f059d2b04243921868cfed2568d4fa65d7b5acd	2013-06-25 12:57:28 -07:00
Yaowu Xu	b9c934df8e	Merge "Enable sse2 implmentation of 8x8 ADST/DCT"	2013-06-25 09:13:22 -07:00
Jingning Han	a32a086d23	Enable sse2 implmentation of 8x8 ADST/DCT This commit makes use of the butterfly structure to enable the sse2 version implementation of 8x8 ADST/DCT hybrid transform coding. The runtime of hybrid transform module goes down from 1170 cycles to 245 cycles. Overall speed-up around 1.5%. Change-Id: Ic808ffd21ece8a9d0410d8c0243d7b6c28ac3b3f	2013-06-24 18:41:33 -07:00
John Koleszar	ece724ae16	Merge "Remove unused vp9_build_intra_predictors_sb{y,uv}_s"	2013-06-24 15:08:58 -07:00
John Koleszar	9e7019f7df	Remove unused vp9_build_intra_predictors_sb{y,uv}_s The functions no longer referenced. Change-Id: If2705dfbc607f79ec8ec2242d5e03bec27a35aaf	2013-06-21 16:10:05 -07:00
Ronald S. Bultje	54b2a59623	Implement SSE2 block_error. Change vp9_block_error() to return a 64bit error variable, change all callers to expect a 64bit return value (this will prevent overflows, which we basically don't check for at all right now). Remove duplicate block_error() function, which fixed that through truncation. Remove old (incompatible) mmx/sse2 block_error SIMD versions and replace with a new one that returns a 64bit value. Encoding time of first 50 frames of bus @ 1500kbps goes from 3min29 to 3min23, i.e. a 3% overall speedup. Change-Id: Ib71ac5508b5ee8a80f1753cd85d72df1629abe68	2013-06-21 12:54:52 -07:00
Ronald S. Bultje	25c588b1e4	Add subtract_block SSE2 version and unit test. 3% faster overall (3min35.0 to 3min28.5). Change-Id: I5ff8a5c2c91586b6632ca5009ad1ea51ce94af5e	2013-06-21 09:35:37 -07:00
Ronald S. Bultje	1e6a32f1af	SSE2/SSSE3 optimizations and unit test for sub_pixel_avg_variance(). Encoding of bus @ 1500kbps (first 50 frames) goes from 3min57 to 3min35, i.e. approximately a 10.5% speedup. Note that the SIMD versions which use a bilinear filter (x_offset & 7 \|\| y_offset & 7) aren't perfectly interleaved, and can probably be improved further in the future. I've marked this with a few TODOs/FIXMEs in the code. Change-Id: I5c9e900c0f0d32e431a50fecae213b510b2549f9	2013-06-20 15:59:48 -07:00
Ronald S. Bultje	8fb6c58191	Implement sse2 and ssse3 versions for all sub_pixel_variance sizes. Overall speedup around 5% (bus @ 1500kbps first 50 frames 4min10 -> 3min58). Specific changes to timings for each function compared to original assembly-optimized versions (or just new version timings if no previous assembly-optimized version was available): sse2 4x4: 99 -> 82 cycles sse2 4x8: 128 cycles sse2 8x4: 121 cycles sse2 8x8: 149 -> 129 cycles sse2 8x16: 235 -> 245 cycles (?) sse2 16x8: 269 -> 203 cycles sse2 16x16: 441 -> 349 cycles sse2 16x32: 641 cycles sse2 32x16: 643 cycles sse2 32x32: 1733 -> 1154 cycles sse2 32x64: 2247 cycles sse2 64x32: 2323 cycles sse2 64x64: 6984 -> 4442 cycles ssse3 4x4: 100 cycles (?) ssse3 4x8: 103 cycles ssse3 8x4: 71 cycles ssse3 8x8: 147 cycles ssse3 8x16: 158 cycles ssse3 16x8: 188 -> 162 cycles ssse3 16x16: 316 -> 273 cycles ssse3 16x32: 535 cycles ssse3 32x16: 564 cycles ssse3 32x32: 973 cycles ssse3 32x64: 1930 cycles ssse3 64x32: 1922 cycles ssse3 64x64: 3760 cycles Change-Id: I81ff6fe51daf35a40d19785167004664d7e0c59d	2013-06-20 09:34:25 -07:00
Jingning Han	7088426976	Merge "Make fdct32 computation flow within 16bit range"	2013-06-18 11:40:14 -07:00
Jingning Han	a41a4860c0	Make fdct32 computation flow within 16bit range This commit makes use of dual fdct32x32 versions for rate-distortion optimization loop and encoding process, respectively. The one for rd loop requires only 16 bits precision for intermediate steps. The original fdct32x32 that allows higher intermediate precision (18 bits) was retained for the encoding process only. This allows speed-up for fdct32x32 in the rd loop. No performance loss observed. Change-Id: I3237770e39a8f87ed17ae5513c87228533397cc3	2013-06-18 09:46:24 -07:00
Jingning Han	0b7910b9ff	Merge "Enable sse2 version of sad8x4/4x8"	2013-06-14 13:15:49 -07:00
Jingning Han	c43af9a8a3	Enable sse2 version of sad8x4/4x8 The encoding time for bus at CIF goes from 661s to 625s. This commit also enabled unit test of sad8x4/4x8 in sad_test.cc. Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1	2013-06-14 09:19:28 -07:00
Jingning Han	15f50e7b42	Enable sse2 version of sad8x4/4x8 The encoding time for bus at CIF goes from 661s to 625s. This commit also enabled unit test of sad8x4/4x8 in sad_test.cc. Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1	2013-06-13 16:18:18 -07:00
Scott LaVarnway	a81bd12a2e	Quick modifications to mb loopfilter intrinsic functions Modified to work with 8x8 blocks of memory. Will revisit later for further optimizations. For the HD clip used, the decoder improved by almost 20%. Change-Id: Iaa4785be293a32a42e8db07141bd699f504b8c67	2013-06-12 19:23:03 -04:00
Yaowu Xu	d682243012	Merge "Quick modifications to wide loopfilter intrinsic functions"	2013-06-12 15:16:11 -07:00
Ronald S. Bultje	fa96eeb835	Implement SSE version for sad4x8x4d and SSE2 version for sad8x4x4d. Encoding time of crew (CIF, first 50 frames) @ 1500kbps goes from 4min56 to 4min42. Change-Id: I92c0c8b32980d2ae7c6dafc8b883a2c7fcd14a9f	2013-06-12 17:40:01 -04:00
Scott LaVarnway	26496c52bf	Quick modifications to wide loopfilter intrinsic functions Modified to work with 8x8 blocks of memory. Will revisit later for further optimizations. For the HD clip used, the decoder improved my 20%. Change-Id: Ia0057f55d66d1445882351ea6c43b595a5a980e5	2013-06-12 16:49:08 -04:00
John Koleszar	ceee4563d6	Remove unused vp9_idct_add_{y,uv}_block These functions are not used, and appear to have been superceded. Change-Id: I86fe51b088264f6b1b8d4d232bba97b371b98120	2013-06-12 12:24:22 -07:00
John Koleszar	0e1e16db90	Enable mmx loop filter routines The mmx routines work as expected for the loop filter, so enable them. Change-Id: I2bbd9b99a4445fcba17bb95002f1fb6e01fe8f85	2013-06-12 11:28:21 -07:00
John Koleszar	44db42c114	Merge the new loopfilter experiment Change-Id: I524ba98841f2e1850e3276ac365c501cea31546d	2013-06-10 12:30:12 -07:00
Ronald S. Bultje	073c7d5eec	Fix firstpass if framesize is not a multiple of 16. Change-Id: Iec41736c2b6140715f90f40de5ae6cf52497a9b8	2013-06-08 13:32:05 -07:00
John Koleszar	736c7b804a	Merge "Reimplementation of loop filter" into experimental	2013-06-06 17:34:26 -07:00
John Koleszar	043d348aae	Reimplementation of loop filter This version of the loop filter supports non-4:2:0 subsampling and a fourth plane, as well as changing the filtering order to be more friendly to hardware implementations. The filters are applied first to all vertical edges within the 64x64 SB, followed by the top horizontal edge and any internal horizontal edges. Since filtering is applied on each 4x4 edge serially, a dependency is created from filtering one block edge to the next. It would be possible to remove this depencnecy by building all filtering decisions from the unfiltered reconstruction data. Change-Id: I08f3e9683eb7bded8a76651cbc50fc0dfdd05fa7	2013-06-06 08:45:45 -07:00
Jim Bankoski	ced21bd6a6	Creates a new speed 1: This speed 1 - uses variance threshold stolen from static-thresh to determine split. Any superblock with greater than the variance set by static thresh * quantizer index squared is split. In addition transform size is set to largest size less than or equal to partition size, sub pixel filter is set to normal, and only 12 modes are used at all. Change-Id: If7a2858ee70f96d1eb989c04fd87a332b147abef	2013-05-30 19:53:00 -07:00
Jingning Han	7ac5ac52f9	Merge 4x4 block level partition into codebase Move 4x4/4x8/8x4 partition coding out of experimental list. This commit fixed the unit test failure issues. It also resolved the merge conflicts between 4x4 block level partition and iterative motion search for comp_inter_inter. Change-Id: I898671f0631f5ddc4f5cc68d4c62ead7de9c5a58	2013-05-23 11:58:50 +01:00
Yunqing Wang	f4fcfe3075	Optimize variance functions Added SSE2 version of variance functions for super blocks. Change-Id: Ibeaae8771ca21c99d41dd74067574a51e97b412d	2013-05-22 10:29:38 -07:00
Scott LaVarnway	0c3f3bf1d5	Removed vp9_recon functions No longer used. Change-Id: Ica5166f7117f4693dffdf7633dcfc1b263103d0d	2013-05-21 13:57:50 -04:00
Scott LaVarnway	ba48a11130	WIP: 4x4 idct/recon merge This patch eliminates the intermediate diff buffer usage by combining the short idct and the add residual into one function. The encoder can use the same code as well. Change-Id: I296604bf73579c45105de0dd1adbcc91bcc53c22	2013-05-20 13:03:17 -04:00
Scott LaVarnway	9aa37a51b2	Merge "WIP: 8x8 idct/recon merge" into experimental	2013-05-16 14:28:30 -07:00
Scott LaVarnway	794a7bedbd	WIP: 8x8 idct/recon merge This patch eliminates the intermediate diff buffer usage by combining the short idct and the add residual into one function. The encoder can use the same code as well. Change-Id: Iacfd57324fbe2b7beca5d7f3dcae25c976e67f45	2013-05-16 13:52:15 -04:00
Jingning Han	8e3d0e4d7d	Add building blocks for 4x8/8x4 rd search These building blocks enable rate-distortion optimization search over block sizes of 8x4 and 4x8. Need to convert them into mmx/sse forms. Change-Id: I570ea2d22d14ceec3fe3575128d7dfa172a577de	2013-05-16 10:41:29 -07:00
Dmitry Kovalev	cd16fe9160	Merge "Preparing vp9_deblock and vp9_denoise to alpha support." into experimental	2013-05-15 15:40:52 -07:00
Scott LaVarnway	a272ff25cd	WIP: 16x16 idct/recon merge This patch eliminates the intermediate diff buffer usage by combining the short idct and the add residual into one function. The encoder can use the same code as well. Change-Id: Iea7976b22b1927d24b8004d2a3fddae7ecca3ba1	2013-05-15 13:16:02 -04:00
Scott LaVarnway	2cf0d4be12	WIP: 32x32 idct/recon merge This patch eliminates the intermediate diff buffer usage by combining the short idct and the add residual into one function. The encoder can use the same code as well. Change-Id: I4ea09df0e162591e420d869b7431c2e7f89a8c1a	2013-05-14 15:54:17 -07:00
Dmitry Kovalev	7bbf716f04	Preparing vp9_deblock and vp9_denoise to alpha support. Change-Id: I299feefa64b93bd62263aea1ff1e41e85faeb6ca	2013-05-14 11:01:57 -07:00
John Koleszar	56efb73be3	Revert "Preparing vp9_deblock and vp9_denoise to alpha support." This reverts commit `a933311131` Change-Id: I2321f88011178381adbcffeda1bcc6a430ab8f1d	2013-05-14 06:46:11 -07:00
Dmitry Kovalev	a933311131	Preparing vp9_deblock and vp9_denoise to alpha support. Change-Id: Id1cc1c2663b9c2219cb830ffb4b0c6ab3468dc04	2013-05-13 14:03:29 -07:00
Dmitry Kovalev	4a559d3448	Merge "Removing unused simple loopfilter code." into experimental	2013-05-10 12:14:34 -07:00
Dmitry Kovalev	effaa3263d	Removing unused simple loopfilter code. Change-Id: Ic11dc052fb641687c015e1bbc37181b9babcd43e	2013-05-10 11:04:43 -07:00
Yunqing Wang	9f5811c2da	Add joint motion search in comp_inter_inter mode(experiment) In current code, motion vectors got from single prediction mode are used in compound prediction mode directly. These motion vectors may not give accurate prediction since they are searched independently. In this patch, we took Pascal's suggestion, and did joint motion search in compound prediction mode to find better motion vectors in this situation. Test results: Overall PSNR: 0.570%(derf), 0.918%(stdhd); SSIM: 0.572%(derf), 1.009%(stdhd); The encoder is a little slower. This can be improved since some c code is used in motion search. Change-Id: Ib30c9240f6c56c9b070867b4ca89412a76d9f3c6	2013-05-10 10:15:43 -07:00
Jingning Han	776c1482a3	Merge SB8X8 into the codebase Pull sb8x8 out of experimental list. verified via borg run tests. Fixed unit test failures. Change-Id: I12a4bbd17395930580c048ab68becad1ffe46e76	2013-05-07 09:08:25 -07:00
Ronald S. Bultje	f7fa367094	Fix first-pass intra4x4 for sb8x8 experiment. Change-Id: I1df17f45721c690d157800daa6a0b377e3d32bc2	2013-05-04 15:49:41 -07:00
Ronald S. Bultje	704fb4866e	Fix right-edge availability for intra prediction in sb8x8. Fixes valgrind uninitialized value use warnings. Change-Id: Ie9314d684e2ad194f8aca5bde1729fb9b7c0221d	2013-05-02 10:16:48 -07:00
Ronald S. Bultje	ff37688a91	Fix block reconstruction with sb8x8 enabled. The encoder reconstruction is now correct. Decoder to follow shortly. Change-Id: Iedf98cdaebb4ca1256c7714cad7024a75853ad6a	2013-05-01 19:28:17 -07:00
Ronald S. Bultje	b6c2d872f0	Fix some crashes in sb8x8 experiment. Change-Id: I390bb1cedc835f439fd5dd6cda6572b29cbb139c	2013-05-01 14:45:27 -07:00
John Koleszar	bb41ab4a0c	Remove BLOCKD structure All members can be referenced from their per-plane counterparts, and removes assumptions about 24 blocks per macroblock. Change-Id: I7ff2fa72d22c29163eb558981c8193765a8113d9	2013-04-26 10:35:54 -07:00
John Koleszar	4f55c5618a	Remove destination pointers from BLOCKD Access these members from MACROBLOCKD instead. Change-Id: I7907230dd473ff12ebe182b9280d8b7f12a888c4	2013-04-26 10:14:07 -07:00

1 2 3 4

161 Commits