generic-library/vpx

Author	SHA1	Message	Date
Mans Rullgard	d85ae87183	vp9: neon: add vp9_mb_lpf_* functions Change-Id: I13e0880df234f15abc4cc7c57fe84488d5d46a75	2013-08-02 08:10:50 -07:00
Jingning Han	67719abde1	Remove unused vp9_short_idct10_32x32_add The inverse 32x32 transform detects all zero entries and skips the computations accordingly per 8 rows in the first 1-D operation. The function vp9_short_idct10_32x32_add performs differently and is not used anywhere, hence removed. Change-Id: Ic4fad422debbde7b6b6ffed47c69fbd4268a906c	2013-08-01 12:45:16 -07:00
Jingning Han	a7c4de22e1	16x16 inverse 2D-DCT with DC only This commit provides special handle on 16x16 inverse 2D-DCT, where only DC coefficient is quantized to be non-zero value. Change-Id: I7bf71be7fa13384fab453dc8742b5b50e77a277c	2013-07-29 14:45:53 -07:00
Ronald S. Bultje	6f3054b65d	Merge "d45 intra prediction SSSE3 optimizations."	2013-07-26 17:21:09 -07:00
Jingning Han	325e0aa650	Special handle on DC only inverse 8x8 2D-DCT This commit enables a special handle for the 8x8 inverse 2D-DCT, where only DC coefficient is quantized to be non-zero. For bus_cif at 2000 kbps, it provides about 1% speed-up at speed 0. Change-Id: I2523222359eec26b144cf8fd4c63a4ad63b1b011	2013-07-26 14:16:51 -07:00
Ronald S. Bultje	94b0c6791d	d45 intra prediction SSSE3 optimizations. Change-Id: Ie48035ff4f93c41f8a9b3023e6444fd10432d8fb	2013-07-26 13:30:02 -07:00
Jingning Han	384e37e32b	SSE2 inverse 4x4 2D-DCT with DC only Add SSE2 implementation to handle the special case of inverse 2D-DCT where only DC coefficient is non-zero. Change-Id: I2c6a59e21e5e77b8cf39a4af5eecf4d5ade32e2f	2013-07-24 23:19:56 -07:00
Jingning Han	d2de1ca37b	Merge vp9_dc_only_idct_add and vp9_short_idct4x4_1 They share the same functionality, so merging together. Change-Id: I98a0386fcee052cb854f9ff90c283c1b844bcb79	2013-07-24 16:51:15 -07:00
hkuang	d757de744c	Add neon optimize vp9_short_idct8x8_add. Change-Id: Ic32acf3e2939c6d12d9c2bf192a5f5da59705fda	2013-07-18 16:40:41 -07:00
Johann	9ca66ec050	Merge "vp9_convolve8_neon placeholder"	2013-07-17 10:09:00 -07:00
Johann	59dc4e9cdd	vp9_convolve8_neon placeholder Call the individually optimized horizontal and vertical functions. This implementation abuses the temp buffer. This will be replaced with a custom optimized function. Over 2x speedup. Change-Id: I5b908d2a73d264e9810d6022bbff73207a3055dd	2013-07-17 08:39:27 -07:00
Jingning Han	d05f66aa10	SSE2 16x16 inverse ADST/DCT hybrid transform This commit enables SSE2 implementation of 16x16 inverse ADST/DCT hybrid transform. The runtime goes from 5742 cycles -> 1821 cycles. This provides about 1% encoding speed-up at speed 0. Change-Id: I1678d0988bf30b9efd524877705bbb3645edb17b	2013-07-16 12:51:42 -07:00
Jingning Han	5851904744	Merge "SSE2 8x8 inverse ADST/DCT transform"	2013-07-16 11:00:11 -07:00
Jingning Han	91365addf8	SSE2 8x8 inverse ADST/DCT transform This commit enables SSE2 implementation of 8x8 inverse ADST/DCT transform. The runtime goes from 1216 cycles -> 266 cycles. For bus_cif at 2000 kbps, the overall runtime reduces from 253707ms -> 248430ms, i.e., 2% speed-up at speed 0. Change-Id: Ib0372e17e9162d7b11a10d653b1c8be547c878fb	2013-07-12 21:03:16 -07:00
Johann	a15bebfc0a	vp9_convolve8_[horiz\|vert]_avg Super basic conversion from the other implementations. Any changes to one should be trivial to copy over keep in sync. Change-Id: I1720b4128e0aba4b2779e3761f6494f8a09d3ea8	2013-07-12 16:21:33 -07:00
Jingning Han	dac5891a1a	Merge "SSE2 4x4 invserse ADST/DCT transform"	2013-07-11 14:17:23 -07:00
Johann	158c80cbb0	convolve8 optimizations for neon Independent horizontal and vertical implementations. Requires that blocks be built from 4x4 and [xy]_step_q4 == 16 6-10% improvement. CIF improved the least. Change-Id: I137f5ceae4440adc0960bf88e4453e55a618bcda	2013-07-11 11:08:19 -07:00
hkuang	c9b25dcae4	Add neon optimize vp9_dc_only_idct_add. Change-Id: Iae84ab945cc9662a0ddd839aa2b9ca59f2ae5423	2013-07-11 10:30:47 -07:00
Jim Bankoski	5000cdf0ff	Merge "Wide loopfilter 16 pix at a time"	2013-07-11 06:44:02 -07:00
Jingning Han	49b6302044	SSE2 4x4 invserse ADST/DCT transform Enable SSE2 4x4 inverse ADST/DCT transform. The runtime goes from 292 cycles down to 89 cycles. Running bus_cif at 2000 kbps, the overall runtime of speed 0 goes from 301s to 295s (2% speed-up). Change-Id: I24098136e7fee7ab2fbf1c11755bdf2ca37f3628	2013-07-10 20:16:02 -07:00
Ronald S. Bultje	decead7336	Replace copy_memNxM functions with a generic copy/avg function. Change-Id: I3ce849452ed4f08527de9565a9914d5ee36170aa	2013-07-10 18:27:24 -07:00
John Koleszar	64f7a4d8cb	Wide loopfilter 16 pix at a time Where possible, do the 16 pixel wide filter while doing the horizontal filtering pass. The same approach can be taken for the mbloop_filter when that's implemented. Doing so on the vertical pass is a little more involved, but possible. Change-Id: I010cb505e623464247ae8f67fa25a0cdac091320	2013-07-10 16:32:44 -07:00
Ronald S. Bultje	e6f955251f	Merge "SSSE3 assembly for 4x4/8x8/16x16/32x32 H intra prediction."	2013-07-10 14:52:23 -07:00
Ronald S. Bultje	6a60249071	Merge "SSE/SSE2 assembly for 4x4/8x8/16x16/32x32 TM intra prediction."	2013-07-10 14:52:19 -07:00
Jingning Han	114423538f	SSE2 16x16 ADST/DCT hybrid transform This commit enables 16x16 ADST/DCT forward hybrid transform using SSE2 operations. It reduces the runtime from 5433 cycles to 1621 cycles, at no compression performance loss. Change-Id: I75fd7f1984e9e28846af459f810ff0d6ae125230	2013-07-10 12:14:53 -07:00
Ronald S. Bultje	7fd643264a	SSSE3 assembly for 4x4/8x8/16x16/32x32 H intra prediction. Change-Id: Iad70966b986f65259329070e258f76ef0af816b4	2013-07-10 09:28:03 -07:00
Ronald S. Bultje	8dade638a1	SSE/SSE2 assembly for 4x4/8x8/16x16/32x32 TM intra prediction. Change-Id: I3441c059214c2956e8261331bbf521525a617a86	2013-07-10 09:28:03 -07:00
Ronald S. Bultje	75b33c68c7	SSE/SSE2 assembly for 4x4/8x8/16x16/32x32 V intra prediction. Change-Id: I55a6cfa2daba738cbc0c4a02f806893f7e556997	2013-07-10 09:28:03 -07:00
Ronald S. Bultje	92c5d3665d	SSE/SSE2 assembly for 4x4/8x8/16x16/32x32 DC intra prediction. Change-Id: Ibe1690afc5459f3b3beca401e7734fcd03da6dd0	2013-07-10 09:28:03 -07:00
Frank Galligan	198fa6d0a0	Add Neon horizontal and vertical vp9_mbloop_filter - The vp9 mbfilter C code will branch on flat and mask. This CL will perform both branches and combine the data. A later CL will perform a check to see if all patch will take one branch. - These functions are about 1.75 times faster than the C code on Nexus 7. PS #3 - Changed all functions to dub limit, blimit, and thresh from vld {dx[]}, freeing up r4-r6. - Changed code to use vbif to reduce one instruction and free up a d register. Change-Id: I028dae0e434dc9891c3677bdb182e201ffb04777	2013-07-09 12:40:05 -07:00
Ronald S. Bultje	8350e7fe38	Make intra prediction pointers RTCD-based. This probably has a mildly negative impact on performance, but will (in future commits - or possibly merged with this one) allow SIMD implementations of individual intra prediction functions. We may perhaps want to consider having separate functions per txfm-size also (i.e. 4x4, 8x8, 16x16 and 32x32 intra prediction functions for each intra prediction mode), but I haven't played much with that yet. Change-Id: Ie739985eee0a3fcbb7aed29ee6910fdb653ea269	2013-07-08 17:25:51 -07:00
Ronald S. Bultje	c8defcfdee	Update quantize SSSE3 SIMD to cover 32x32 transform case also. Encode time of bus (speed 0) 50 frames @ 1500kbps goes from 2min14.4 to 2min10.1, i.e. a 2.3% overall speed increase. Change-Id: I3699580e74ec26c7d24e03681bc47ba25ee1ee87	2013-07-01 11:36:33 -07:00
Ronald S. Bultje	7353ceab9d	Quantize (64-bit only, for now) SSSE3 SIMD. Total encoding time for first 50 frames of bus (speed 0) @ 1500kbps goes 2min34.8 to 2min14.4, i.e. a 10.4% overall speedup. The code is x86-64 only, it needs some minor modifications to be 32bit compatible, because it uses 15 xmm registers, whereas 32bit only has 8. Change-Id: I2df53770c2e850813ffa713e1a91b45b0082b904	2013-07-01 11:36:07 -07:00
Jingning Han	993942ce0c	Merge "Enable SSE2 4x4 ADST/DCT transform"	2013-06-29 15:57:04 -07:00
Christian Duvivier	466e0cf303	SSE2 version of vp9_short_fdct32x32_rd. 43,000 -> 5,750 cycles, about 7.5x faster. Change-Id: Ibfd92821b9603f4ed9c256e0ececec14fa4565d0	2013-06-29 13:53:00 -07:00
chm	a83cfd4da1	add Neon optimized add constant residual functions - Add add_constant_residual_8x8 16x16 32x32 functions - Tested under RealView debugger enviroment Change-Id: I5c3a432f651b49bf375de6496353706a33e3e68e	2013-06-28 19:06:51 -07:00
Jingning Han	1109b6b888	Enable SSE2 4x4 ADST/DCT transform This commit enables SSE2 4x4 foward hybrid transform. The runtime goes from 249 cycles down to 74 cycles. Overall around 2% speed-up at no compression performance change. Change-Id: Iad4d526346e05c7be896466c05500711bb763660	2013-06-28 17:24:43 -07:00
Ronald S. Bultje	af660715c0	Make coefficient skip condition an explicit RD choice. This commit replaces zrun_zbin_boost, a method of biasing non-zero coefficients following runs of zero-coefficients to be rounded towards zero, with an explicit skip-block choice in the RD loop. The logic is basically that if individual coefficients should be rounded towards zero (from a RD point of view), the trellis/optimize loop should take care of it. If whole blocks should be zero (from a RD point of view), a single RD check is much more efficient than a complete serialization of the quantization loop. Quality change: derf +0.5% psnr, +1.6% ssim; yt +0.6% psnr, +1.1% ssim. SIMD for quantize will follow in a separate patch. Results for other test sets pending. Change-Id: Ife5fa641163ac5150ac428011e87188f1937c1f4	2013-06-28 10:28:49 -07:00
Frank Galligan	1d6dc1b702	Add Neon optimized loop filter functions. - Added vp9_loop_filter_horizontal_edge_neon and vp9_loop_filter_vertical_edge_neon. - The functions are based off the vp8 loopfilter functions. - Matches x86 md5 checksum. Change-Id: Id1c4dddb03584227e5ecd29f574a6ac27738fdd0	2013-06-27 16:14:45 -07:00
Jingning Han	3cc8c8c3a0	Merge "Refactor intra predictor block"	2013-06-25 19:46:55 -07:00
Jingning Han	d19ea3861d	Refactor intra predictor block Remove vp9_intra4x4_predict(). Use the common intra prediction function for all block sizes. Change-Id: Ibd19d51dfa3da8bbdfb79ddeb81530b2e2089560	2013-06-25 16:33:13 -07:00
Ronald S. Bultje	c24d922396	Add averaging-SAD functions for 8-point comp-inter motion search. Makes first 50 frames of bus @ 1500kbps encode from 3min22.7 to 3min18.2, i.e. 2.3% faster. In addition, use the sub_pixel_avg functions to calc the variance of the averaging predictor. This is slightly suboptimal because the function is subpixel-position-aware, but it will (at least for the SSE2 version) not actually use a bilinear filter for a full-pixel position, thus leading to approximately the same performance compared to if we implemented an actual average-aware full-pixel variance function. That gains another 0.3 seconds (i.e. encode time goes to 3min17.4), thus leading to a total gain of 2.7%. Change-Id: I3f059d2b04243921868cfed2568d4fa65d7b5acd	2013-06-25 12:57:28 -07:00
Yaowu Xu	b9c934df8e	Merge "Enable sse2 implmentation of 8x8 ADST/DCT"	2013-06-25 09:13:22 -07:00
Jingning Han	a32a086d23	Enable sse2 implmentation of 8x8 ADST/DCT This commit makes use of the butterfly structure to enable the sse2 version implementation of 8x8 ADST/DCT hybrid transform coding. The runtime of hybrid transform module goes down from 1170 cycles to 245 cycles. Overall speed-up around 1.5%. Change-Id: Ic808ffd21ece8a9d0410d8c0243d7b6c28ac3b3f	2013-06-24 18:41:33 -07:00
John Koleszar	ece724ae16	Merge "Remove unused vp9_build_intra_predictors_sb{y,uv}_s"	2013-06-24 15:08:58 -07:00
John Koleszar	9e7019f7df	Remove unused vp9_build_intra_predictors_sb{y,uv}_s The functions no longer referenced. Change-Id: If2705dfbc607f79ec8ec2242d5e03bec27a35aaf	2013-06-21 16:10:05 -07:00
Ronald S. Bultje	54b2a59623	Implement SSE2 block_error. Change vp9_block_error() to return a 64bit error variable, change all callers to expect a 64bit return value (this will prevent overflows, which we basically don't check for at all right now). Remove duplicate block_error() function, which fixed that through truncation. Remove old (incompatible) mmx/sse2 block_error SIMD versions and replace with a new one that returns a 64bit value. Encoding time of first 50 frames of bus @ 1500kbps goes from 3min29 to 3min23, i.e. a 3% overall speedup. Change-Id: Ib71ac5508b5ee8a80f1753cd85d72df1629abe68	2013-06-21 12:54:52 -07:00
Ronald S. Bultje	25c588b1e4	Add subtract_block SSE2 version and unit test. 3% faster overall (3min35.0 to 3min28.5). Change-Id: I5ff8a5c2c91586b6632ca5009ad1ea51ce94af5e	2013-06-21 09:35:37 -07:00
Ronald S. Bultje	1e6a32f1af	SSE2/SSSE3 optimizations and unit test for sub_pixel_avg_variance(). Encoding of bus @ 1500kbps (first 50 frames) goes from 3min57 to 3min35, i.e. approximately a 10.5% speedup. Note that the SIMD versions which use a bilinear filter (x_offset & 7 \|\| y_offset & 7) aren't perfectly interleaved, and can probably be improved further in the future. I've marked this with a few TODOs/FIXMEs in the code. Change-Id: I5c9e900c0f0d32e431a50fecae213b510b2549f9	2013-06-20 15:59:48 -07:00
Ronald S. Bultje	8fb6c58191	Implement sse2 and ssse3 versions for all sub_pixel_variance sizes. Overall speedup around 5% (bus @ 1500kbps first 50 frames 4min10 -> 3min58). Specific changes to timings for each function compared to original assembly-optimized versions (or just new version timings if no previous assembly-optimized version was available): sse2 4x4: 99 -> 82 cycles sse2 4x8: 128 cycles sse2 8x4: 121 cycles sse2 8x8: 149 -> 129 cycles sse2 8x16: 235 -> 245 cycles (?) sse2 16x8: 269 -> 203 cycles sse2 16x16: 441 -> 349 cycles sse2 16x32: 641 cycles sse2 32x16: 643 cycles sse2 32x32: 1733 -> 1154 cycles sse2 32x64: 2247 cycles sse2 64x32: 2323 cycles sse2 64x64: 6984 -> 4442 cycles ssse3 4x4: 100 cycles (?) ssse3 4x8: 103 cycles ssse3 8x4: 71 cycles ssse3 8x8: 147 cycles ssse3 8x16: 158 cycles ssse3 16x8: 188 -> 162 cycles ssse3 16x16: 316 -> 273 cycles ssse3 16x32: 535 cycles ssse3 32x16: 564 cycles ssse3 32x32: 973 cycles ssse3 32x64: 1930 cycles ssse3 64x32: 1922 cycles ssse3 64x64: 3760 cycles Change-Id: I81ff6fe51daf35a40d19785167004664d7e0c59d	2013-06-20 09:34:25 -07:00
Jingning Han	7088426976	Merge "Make fdct32 computation flow within 16bit range"	2013-06-18 11:40:14 -07:00
Jingning Han	a41a4860c0	Make fdct32 computation flow within 16bit range This commit makes use of dual fdct32x32 versions for rate-distortion optimization loop and encoding process, respectively. The one for rd loop requires only 16 bits precision for intermediate steps. The original fdct32x32 that allows higher intermediate precision (18 bits) was retained for the encoding process only. This allows speed-up for fdct32x32 in the rd loop. No performance loss observed. Change-Id: I3237770e39a8f87ed17ae5513c87228533397cc3	2013-06-18 09:46:24 -07:00
Jingning Han	0b7910b9ff	Merge "Enable sse2 version of sad8x4/4x8"	2013-06-14 13:15:49 -07:00
Jingning Han	c43af9a8a3	Enable sse2 version of sad8x4/4x8 The encoding time for bus at CIF goes from 661s to 625s. This commit also enabled unit test of sad8x4/4x8 in sad_test.cc. Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1	2013-06-14 09:19:28 -07:00
Jingning Han	15f50e7b42	Enable sse2 version of sad8x4/4x8 The encoding time for bus at CIF goes from 661s to 625s. This commit also enabled unit test of sad8x4/4x8 in sad_test.cc. Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1	2013-06-13 16:18:18 -07:00
Scott LaVarnway	a81bd12a2e	Quick modifications to mb loopfilter intrinsic functions Modified to work with 8x8 blocks of memory. Will revisit later for further optimizations. For the HD clip used, the decoder improved by almost 20%. Change-Id: Iaa4785be293a32a42e8db07141bd699f504b8c67	2013-06-12 19:23:03 -04:00
Yaowu Xu	d682243012	Merge "Quick modifications to wide loopfilter intrinsic functions"	2013-06-12 15:16:11 -07:00
Ronald S. Bultje	fa96eeb835	Implement SSE version for sad4x8x4d and SSE2 version for sad8x4x4d. Encoding time of crew (CIF, first 50 frames) @ 1500kbps goes from 4min56 to 4min42. Change-Id: I92c0c8b32980d2ae7c6dafc8b883a2c7fcd14a9f	2013-06-12 17:40:01 -04:00
Scott LaVarnway	26496c52bf	Quick modifications to wide loopfilter intrinsic functions Modified to work with 8x8 blocks of memory. Will revisit later for further optimizations. For the HD clip used, the decoder improved my 20%. Change-Id: Ia0057f55d66d1445882351ea6c43b595a5a980e5	2013-06-12 16:49:08 -04:00
John Koleszar	ceee4563d6	Remove unused vp9_idct_add_{y,uv}_block These functions are not used, and appear to have been superceded. Change-Id: I86fe51b088264f6b1b8d4d232bba97b371b98120	2013-06-12 12:24:22 -07:00
John Koleszar	0e1e16db90	Enable mmx loop filter routines The mmx routines work as expected for the loop filter, so enable them. Change-Id: I2bbd9b99a4445fcba17bb95002f1fb6e01fe8f85	2013-06-12 11:28:21 -07:00
John Koleszar	44db42c114	Merge the new loopfilter experiment Change-Id: I524ba98841f2e1850e3276ac365c501cea31546d	2013-06-10 12:30:12 -07:00
Ronald S. Bultje	073c7d5eec	Fix firstpass if framesize is not a multiple of 16. Change-Id: Iec41736c2b6140715f90f40de5ae6cf52497a9b8	2013-06-08 13:32:05 -07:00
John Koleszar	736c7b804a	Merge "Reimplementation of loop filter" into experimental	2013-06-06 17:34:26 -07:00
John Koleszar	043d348aae	Reimplementation of loop filter This version of the loop filter supports non-4:2:0 subsampling and a fourth plane, as well as changing the filtering order to be more friendly to hardware implementations. The filters are applied first to all vertical edges within the 64x64 SB, followed by the top horizontal edge and any internal horizontal edges. Since filtering is applied on each 4x4 edge serially, a dependency is created from filtering one block edge to the next. It would be possible to remove this depencnecy by building all filtering decisions from the unfiltered reconstruction data. Change-Id: I08f3e9683eb7bded8a76651cbc50fc0dfdd05fa7	2013-06-06 08:45:45 -07:00
Jim Bankoski	ced21bd6a6	Creates a new speed 1: This speed 1 - uses variance threshold stolen from static-thresh to determine split. Any superblock with greater than the variance set by static thresh * quantizer index squared is split. In addition transform size is set to largest size less than or equal to partition size, sub pixel filter is set to normal, and only 12 modes are used at all. Change-Id: If7a2858ee70f96d1eb989c04fd87a332b147abef	2013-05-30 19:53:00 -07:00
Jingning Han	7ac5ac52f9	Merge 4x4 block level partition into codebase Move 4x4/4x8/8x4 partition coding out of experimental list. This commit fixed the unit test failure issues. It also resolved the merge conflicts between 4x4 block level partition and iterative motion search for comp_inter_inter. Change-Id: I898671f0631f5ddc4f5cc68d4c62ead7de9c5a58	2013-05-23 11:58:50 +01:00
Yunqing Wang	f4fcfe3075	Optimize variance functions Added SSE2 version of variance functions for super blocks. Change-Id: Ibeaae8771ca21c99d41dd74067574a51e97b412d	2013-05-22 10:29:38 -07:00
Scott LaVarnway	0c3f3bf1d5	Removed vp9_recon functions No longer used. Change-Id: Ica5166f7117f4693dffdf7633dcfc1b263103d0d	2013-05-21 13:57:50 -04:00
Scott LaVarnway	ba48a11130	WIP: 4x4 idct/recon merge This patch eliminates the intermediate diff buffer usage by combining the short idct and the add residual into one function. The encoder can use the same code as well. Change-Id: I296604bf73579c45105de0dd1adbcc91bcc53c22	2013-05-20 13:03:17 -04:00
Scott LaVarnway	9aa37a51b2	Merge "WIP: 8x8 idct/recon merge" into experimental	2013-05-16 14:28:30 -07:00
Scott LaVarnway	794a7bedbd	WIP: 8x8 idct/recon merge This patch eliminates the intermediate diff buffer usage by combining the short idct and the add residual into one function. The encoder can use the same code as well. Change-Id: Iacfd57324fbe2b7beca5d7f3dcae25c976e67f45	2013-05-16 13:52:15 -04:00
Jingning Han	8e3d0e4d7d	Add building blocks for 4x8/8x4 rd search These building blocks enable rate-distortion optimization search over block sizes of 8x4 and 4x8. Need to convert them into mmx/sse forms. Change-Id: I570ea2d22d14ceec3fe3575128d7dfa172a577de	2013-05-16 10:41:29 -07:00
Dmitry Kovalev	cd16fe9160	Merge "Preparing vp9_deblock and vp9_denoise to alpha support." into experimental	2013-05-15 15:40:52 -07:00
Scott LaVarnway	a272ff25cd	WIP: 16x16 idct/recon merge This patch eliminates the intermediate diff buffer usage by combining the short idct and the add residual into one function. The encoder can use the same code as well. Change-Id: Iea7976b22b1927d24b8004d2a3fddae7ecca3ba1	2013-05-15 13:16:02 -04:00
Scott LaVarnway	2cf0d4be12	WIP: 32x32 idct/recon merge This patch eliminates the intermediate diff buffer usage by combining the short idct and the add residual into one function. The encoder can use the same code as well. Change-Id: I4ea09df0e162591e420d869b7431c2e7f89a8c1a	2013-05-14 15:54:17 -07:00
Dmitry Kovalev	7bbf716f04	Preparing vp9_deblock and vp9_denoise to alpha support. Change-Id: I299feefa64b93bd62263aea1ff1e41e85faeb6ca	2013-05-14 11:01:57 -07:00
John Koleszar	56efb73be3	Revert "Preparing vp9_deblock and vp9_denoise to alpha support." This reverts commit `a933311131` Change-Id: I2321f88011178381adbcffeda1bcc6a430ab8f1d	2013-05-14 06:46:11 -07:00
Dmitry Kovalev	a933311131	Preparing vp9_deblock and vp9_denoise to alpha support. Change-Id: Id1cc1c2663b9c2219cb830ffb4b0c6ab3468dc04	2013-05-13 14:03:29 -07:00
Dmitry Kovalev	4a559d3448	Merge "Removing unused simple loopfilter code." into experimental	2013-05-10 12:14:34 -07:00
Dmitry Kovalev	effaa3263d	Removing unused simple loopfilter code. Change-Id: Ic11dc052fb641687c015e1bbc37181b9babcd43e	2013-05-10 11:04:43 -07:00
Yunqing Wang	9f5811c2da	Add joint motion search in comp_inter_inter mode(experiment) In current code, motion vectors got from single prediction mode are used in compound prediction mode directly. These motion vectors may not give accurate prediction since they are searched independently. In this patch, we took Pascal's suggestion, and did joint motion search in compound prediction mode to find better motion vectors in this situation. Test results: Overall PSNR: 0.570%(derf), 0.918%(stdhd); SSIM: 0.572%(derf), 1.009%(stdhd); The encoder is a little slower. This can be improved since some c code is used in motion search. Change-Id: Ib30c9240f6c56c9b070867b4ca89412a76d9f3c6	2013-05-10 10:15:43 -07:00
Jingning Han	776c1482a3	Merge SB8X8 into the codebase Pull sb8x8 out of experimental list. verified via borg run tests. Fixed unit test failures. Change-Id: I12a4bbd17395930580c048ab68becad1ffe46e76	2013-05-07 09:08:25 -07:00
Ronald S. Bultje	f7fa367094	Fix first-pass intra4x4 for sb8x8 experiment. Change-Id: I1df17f45721c690d157800daa6a0b377e3d32bc2	2013-05-04 15:49:41 -07:00
Ronald S. Bultje	704fb4866e	Fix right-edge availability for intra prediction in sb8x8. Fixes valgrind uninitialized value use warnings. Change-Id: Ie9314d684e2ad194f8aca5bde1729fb9b7c0221d	2013-05-02 10:16:48 -07:00
Ronald S. Bultje	ff37688a91	Fix block reconstruction with sb8x8 enabled. The encoder reconstruction is now correct. Decoder to follow shortly. Change-Id: Iedf98cdaebb4ca1256c7714cad7024a75853ad6a	2013-05-01 19:28:17 -07:00
Ronald S. Bultje	b6c2d872f0	Fix some crashes in sb8x8 experiment. Change-Id: I390bb1cedc835f439fd5dd6cda6572b29cbb139c	2013-05-01 14:45:27 -07:00
John Koleszar	bb41ab4a0c	Remove BLOCKD structure All members can be referenced from their per-plane counterparts, and removes assumptions about 24 blocks per macroblock. Change-Id: I7ff2fa72d22c29163eb558981c8193765a8113d9	2013-04-26 10:35:54 -07:00
John Koleszar	4f55c5618a	Remove destination pointers from BLOCKD Access these members from MACROBLOCKD instead. Change-Id: I7907230dd473ff12ebe182b9280d8b7f12a888c4	2013-04-26 10:14:07 -07:00
Scott LaVarnway	57f180b388	Removed bmi from blockd This originally was "Removed update_blockd_bmi()". Now, this patch removed bmi from blockd and uses the bmi found in mode_info_context. Eliminates unnecessary bmi copies between blockd and mode_info_context. Change-Id: I287a4972974bb363f49e528daa9b2a2293f4bc76	2013-04-26 10:19:43 -04:00
John Koleszar	4bd0f4f646	Remove BLOCK structure All members can be referenced from their per-plane counterparts, and removes assumptions about 24 blocks per macroblock. Change-Id: I593fb0715e74cd84b48facd1c9b18c3ae1185d4b	2013-04-25 11:33:17 -07:00
Jingning Han	b42b41c856	Merge "Move sbsegment out of experimental list" into experimental	2013-04-25 09:18:01 -07:00
Scott LaVarnway	a426c7f343	Merge "Moved dequantization into the token decoder" into experimental	2013-04-25 08:53:42 -07:00
Jingning Han	b0e3b3df18	Move sbsegment out of experimental list Move rectangular superblock coding out of experimental list. Change-Id: I96c37547d122330d666a67b4bf577ae54547857f	2013-04-24 15:19:17 -07:00
John Koleszar	4f35e3e1c1	Merge "Move src_diff to per-plane MACROBLOCK data" into experimental	2013-04-23 16:24:08 -07:00
John Koleszar	cbd1315ac4	Move src_diff to per-plane MACROBLOCK data First in a series of commits making certain MACROBLOCK members addressable per-plane. This commit also refactors the block subtraction functions vp9_subtract_b, vp9_subtract_sby_c, etc to be loops-over-planes and variable subsampling aware. Change-Id: I371d092b914ae0a495dfd852ea1a3d2467be6ec3	2013-04-23 12:18:51 -07:00
Deb Mukherjee	735febf1ce	Removing the implicit compound inter experiment Removing this experiment for now, since it has been broken with the latest code changes. Change-Id: I1be2181b56de490fcb577f5905b5e147a8ed82d8	2013-04-22 16:46:54 -07:00
Scott LaVarnway	e732bc298c	Moved dequantization into the token decoder Mostly for cleanup purposes. Now we should be able to rework the encoder/decoder to use a common idct/add function. Change-Id: I1597cc59812f362ecec0a3493b6101a6cc6fa7ff	2013-04-22 17:53:07 -04:00
John Koleszar	9ec0f658a1	Remove vp9_recon_mb{,y} Use the common sb functions instead. Change-Id: I4fa0a8ee3c6ada56271dd09bf895b97642f55858	2013-04-19 12:12:00 -07:00
John Koleszar	d747986d29	Remove redundant pointers from void vp9_recon_sb{y,uv} Remove the unnecessary _s_ from their names, and add a new vp9_recon_sb() that calls the y and uv variants. Change-Id: I7ffaa5ff5605a8472cac2a53de8cf889353039a6	2013-04-19 12:06:07 -07:00
Jingning Han	6f43ff5824	Make the use of pred buffers consistent in MB/SB Use in-place buffers (dst of MACROBLOCKD) for macroblock prediction. This makes the macroblock buffer handling consistent with those of superblock. Remove predictor buffer MACROBLOCKD. Change-Id: Id1bcd898961097b1e6230c10f0130753a59fc6df	2013-04-18 14:59:36 -07:00
Yaowu Xu	acfc5981c3	Merge "clean out experiments" into experimental	2013-04-17 14:53:00 -07:00
Yaowu Xu	421ad3f1b1	clean out experiments that are related to using reconstructed pixel for selecting reference motion vectors. Change-Id: I048dfae39ca7385e344b57d46347ecc6e753e1bb	2013-04-17 11:00:46 -07:00
Ronald S. Bultje	0c481f4d18	Add SSE2 versions for rectangular sad and sad4d functions. About 11% overall encoder speedup with the sbsegment experiment enabled. Change-Id: Iffb1bdba6932d9f11a6c791cda8697ccf9327183	2013-04-17 10:31:59 -07:00
Christian Duvivier	f13b69d07c	Faster vp9_short_fdct4x4 and vp9_short_fdct8x4. Scalar path is about 1.3x faster (2.1% overall encoder speedup). SSE2 path is about 5.0x faster (8.4% overall encoder speedup). Change-Id: I360d167b5ad6f387bba00406129323e2fe6e7dda	2013-04-16 16:11:56 -07:00
Scott LaVarnway	466f395148	Merge "Removing extra params from x_add_residual() functions" into experimental	2013-04-16 08:58:28 -07:00
Scott LaVarnway	6f95d53e37	Removing extra params from x_add_residual() functions Now that the predictor is the dest, we do not need the extra parameters. Change-Id: I31e2c3d2015f4a1cd12e7f04536d8db478582a0a	2013-04-16 09:59:01 -04:00
Scott LaVarnway	5393379c84	Merge "Removing extra params in dequant functions" into experimental	2013-04-16 06:37:00 -07:00
Jingning Han	aaf33d7df5	Add rectangular block size variance/sad functions. With this, the RD loop properly supports rectangular blocks. Change-Id: Iece79048fb4e84741ee1ada982da129a7bf00470	2013-04-15 13:39:07 -07:00
Scott LaVarnway	74610b1ae4	Removing extra params in dequant functions Now that the predictor is the dest, we do not need the extra parameters. Change-Id: I78db73d39b5aff62f15303f3d51ad2797eae74b6	2013-04-15 13:43:11 -04:00
Jingning Han	815e95fbeb	Make intra predictor support rectangular blocks The intra predictor supports configurable block sizes. It can handle intra prediction down to 4x4 sizes, when enabled in BLOCK_SIZE_TYPE. Change-Id: I7399ec2512393aa98aadda9813ca0c83e19af854	2013-04-11 16:45:57 -07:00
John Koleszar	2f19cd03aa	Merge "Remove unused vp9_recon_mb{y,uv}_s" into experimental	2013-04-11 15:51:20 -07:00
John Koleszar	c382ed09f8	Remove unused vp9_recon_mb{y,uv}_s These functions now are handled through the common superblock code. Change-Id: Ib6688971bae297896dcec42fae1d3c79af7a611c	2013-04-11 14:05:59 -07:00
Scott LaVarnway	6189f2bcb1	WIP: removing predictor buffer usage from decoder This patch will use the dest buffer instead of the predictor buffer. This will allow us in future commits to remove the extra mem copy that occurs in the dequant functions when eob == 0. We should also be able to remove extra params that are passed into the dequant functions. Change-Id: I7241bc1ab797a430418b1f3a95b5476db7455f6a	2013-04-11 13:55:18 -07:00
Ronald S. Bultje	b4f6098ef7	Make RD superblock mode search size-agnostic. Merge various super_block_yrd and super_block_uvrd versions into one common function that works for all sizes. Make transform size selection size-agnostic also. This fixes a slight bug in the intra UV superblock code where it used the wrong transform size for txsz > 8x8, and stores the txsz selection for superblocks properly (instead of forgetting it). Lastly, it removes the trellis search that was done for 16x16 intra predictors, since trellis is relatively expensive and should thus only be done after RD mode selection. Gives basically identical results on derf (+0.009%). Change-Id: If4485c6f0a0fe4038b3172f7a238477c35a6f8d3	2013-04-10 16:50:30 -07:00
Ronald S. Bultje	a3874850dd	Make SB coding size-independent. Merge sb32x32 and sb64x64 functions; allow for rectangular sizes. Code gives identical encoder results before and after. There are a few macros for rectangular block sizes under the sbsegment experiment; this experiment is not yet functional and should not yet be used. Change-Id: I71f93b5d2a1596e99a6f01f29c3f0a456694d728	2013-04-09 21:28:27 -07:00
John Koleszar	4c05a051ab	Move qcoeff, dqcoeff from BLOCKD to per-plane data Start grouping data per-plane, as part of refactoring to support additional planes, and chroma planes with other-than 4:2:0 subsampling. Change-Id: Idb76a0e23ab239180c818025bae1f36f1608bb23	2013-04-04 16:30:57 -07:00
Yunqing Wang	0e91bec4b5	Merge "Optimize 32x32 idct function" into experimental	2013-03-27 11:30:48 -07:00
Yunqing Wang	21a718d9a7	Optimize 32x32 idct function Wrote sse2 version of vp9_short_idct_32x32 function. Compared to c version, the sse2 version is 5X faster. Change-Id: I071ab7378358346ab4d9c6e2980f713c3c209864	2013-03-27 11:05:42 -07:00
Deb Mukherjee	23144d2345	Implicit weighted prediction experiment Adds an experiment to use a weighted prediction of two INTER predictors, where the weight is one of (1/4, 3/4), (3/8, 5/8), (1/2, 1/2), (5/8, 3/8) or (3/4, 1/4), and is chosen implicitly based on consistency of the predictors to the already reconstructed pixels to the top and left of the current macroblock or superblock. Currently the weighting is not applied to SPLITMV modes, which default to the usual (1/2, 1/2) weighting. However the code is in place controlled by a macro. The same weighting is used for Y and UV components, where the weight is derived from analyzing the Y component only. Results (over compound inter-intra experiment) derf: +0.18% yt: +0.34% hd: +0.49% stdhd: +0.23% The experiment suggests bigger benefit for explicitly signaled weights. Change-Id: I5438539ff4485c5752874cd1eb078ff14bf5235a	2013-03-26 16:58:56 -07:00
Yunqing Wang	869d6c0534	Optimize 16x16 idct10 function Wrote sse2 version of vp9_short_idct10_16x16 function. Compared to c version, the sse2 version is 2.3X faster. Change-Id: I314c4f09369648721798321eeed6f58e38857f26	2013-03-21 16:36:01 -07:00
Yunqing Wang	ec3100661c	Optimize 16x16 idct function Wrote sse2 version of vp9_short_idct16x16 function. Compared to c version, the sse2 version is over 2.5X faster. Change-Id: I38536e2b846427a2cc5c5423aaf305fd0e605d61	2013-03-21 11:44:05 -07:00
Yunqing Wang	6344c84c82	Optimize 8x8 idct function Wrote sse2 functions of vp9_short_idct8x8 and vp9_short_idct10_8x8. Compared to c version, the sse2 version is 2X faster. The decoder test didn't show noticeable gain since 8x8 idct doesn't take much of decoding time (less than 1% in my test). Change-Id: I56313e18cd481700b3b52c4eda5ca204ca6365f3	2013-03-18 15:34:14 -07:00
Yaowu Xu	12ade55719	Merge "removed reference to "LLM" and "x8"" into experimental	2013-03-18 08:51:19 -07:00
Christian Duvivier	4418b790a7	Faster vp9_short_fdct16x16. Scalar path is about 1.5x faster (3.1% overall encoder speedup). SSE2 path is about 7.2x faster (7.8% overall encoder speedup). Change-Id: I06da5ad0cdae2488431eabf002b0d898d66d8289	2013-03-15 15:55:31 -07:00
Yaowu Xu	005552639b	removed reference to "LLM" and "x8" The commit changed the name of files and function to remove obselete reference to LLM and x8. Change-Id: I973b20fc1a55149ed68b5408b3874768e6f88516	2013-03-13 08:35:46 -07:00
Yunqing Wang	11ca81f8b6	Add vp9_idct4_1d_sse2 Added SSE2 idct4_1d which is called by vp9_short_iht4x4. Also, modified the parameter type passed to vp9_short_iht functions to make it work with rtcd prototype. Change-Id: I81ba7cb4db6738f1923383b52a06deb760923ffe	2013-03-08 15:04:22 -08:00
Yunqing Wang	f240782650	Optimize add_constant_residual function Optimized adding constant diff to predictor, which gave about 2% decoder performance gain. Change-Id: I47db20c31428e8c4a8f16214a85cbe386a6e9303	2013-03-07 15:49:07 -08:00
Yunqing Wang	f4e383f3d1	Merge "Optimize add_residual function" into experimental	2013-03-05 16:47:58 -08:00
Yunqing Wang	943c6d7172	Optimize add_residual function Optimized adding diff to predictor, which gave 0.8% decoder performance gain. Change-Id: Ic920f0baa8cbd13a73fa77b7f9da83b58749f0f8	2013-03-05 16:27:45 -08:00
Ronald S. Bultje	4209bba462	Merge changes Ifacbf5a0,Ibad7c3dd into experimental * changes: vpxenc: actually report mismatch on stderr. Make superblocks independent of macroblock code and data.	2013-03-05 11:17:14 -08:00
Ronald S. Bultje	111ca42133	Make superblocks independent of macroblock code and data. Split macroblock and superblock tokenization and detokenization functions and coefficient-related data structs so that the bitstream layout and related code of superblock coefficients looks less like it's a hack to fit macroblocks in superblocks. In addition, unify chroma transform size selection from luma transform size (i.e. always use the same size, as long as it fits the predictor); in practice, this means 32x32 and 64x64 superblocks using the 16x16 luma transform will now use the 16x16 (instead of the 8x8) chroma transform, and 64x64 superblocks using the 32x32 luma transform will now use the 32x32 (instead of the 16x16) chroma transform. Lastly, add a trellis optimize function for 32x32 transform blocks. HD gains about 0.3%, STDHD about 0.15% and derf about 0.1%. There's a few negative points here and there that I might want to analyze a little closer. Change-Id: Ibad7c3ddfe1acfc52771dfc27c03e9783e054430	2013-03-04 16:34:36 -08:00
Yunqing Wang	37932d9168	Merge "Optimize vp9_short_idct4x4llm function" into experimental	2013-03-04 14:13:31 -08:00
Yunqing Wang	e8bc9f4220	Optimize vp9_short_idct4x4llm function Wrote a SSE2 vp9_short_idct4x4llm to improve the decoder performance. Change-Id: I90b9d48c4bf37aaf47995bffe7e584e6d4a2c000	2013-03-04 12:01:27 -08:00
John Koleszar	1cfc86ebe0	Add unit test for x4 multi-SAD functions Update the function prototypes to match between VP9 and VP8. Change-Id: If58965073989e87df3b62b67a030ec6ce23ca04f	2013-03-01 18:14:02 -08:00
John Koleszar	69c67c9531	Merge master branch into experimental Picks up some build system changes, compiler warning fixes, etc. Change-Id: I2712f99e653502818a101a72696ad54018152d4e	2013-03-01 11:06:05 -08:00
Yunqing Wang	c550bb3b09	Add eob<=10 case in idct32x32 Simplified idct32x32 calculation when there are only 10 or less non-zero coefficients in 32x32 block. This helps the decoder performance. Change-Id: If7f8893d27b64a9892b4b2621a37fdf4ac0c2a6d	2013-02-28 16:40:29 -08:00
Yunqing Wang	72b146690a	Merge "Refactor vp9_dequant_idct_add function" into experimental	2013-02-28 14:34:27 -08:00
Yunqing Wang	6193bc3ba8	Refactor vp9_dequant_idct_add function Provided a wrapper and removed duplicate code. Change-Id: Iaef842226ec348422e459202793b001d0983ea30	2013-02-28 14:18:46 -08:00
Scott LaVarnway	aa8fb070b8	Removed vp9_dequantize_b Change-Id: Ie89bd00d58e30bf4094cb748a282f1dfa81a31d8	2013-02-28 14:08:12 -08:00
Jim Bankoski	714aa9f3c0	this commit converts all sad ptrs to uint32 sse4_1 code used uint16_t for returning sad, but that won't work for 32x32 or 64x64. This code fixes the assembly for those and also reenables sse4_1 on linux Change-Id: I5ce7288d581db870a148e5f7c5092826f59edd81	2013-02-28 08:46:35 -08:00
Christian Duvivier	c129203f7e	Faster vp9_short_fdct8x8. Scalar path is about 1.4x faster (4% overall encoder speedup). SSE2 path is about 7x faster (13% overall encoder speedup). Change-Id: I7e85d8225a914a74c61ea370210414696560094d	2013-02-27 17:23:08 -08:00
John Koleszar	5ac141187a	Merge "Remove unused vp9_copy32xn" into experimental	2013-02-27 12:23:45 -08:00
John Koleszar	7ad8dbe417	Remove unused vp9_copy32xn This function was part of an optimization used in VP8 that required caching two macroblocks. This is unused in VP9, and might not survive refactoring to support superblocks, so removing it for now. Change-Id: I744e585206ccc1ef9a402665c33863fc9fb46f0d	2013-02-27 10:24:56 -08:00
Yunqing Wang	35bc02c6eb	Optimize vp9_dc_only_idct_add_c function Wrote SSE2 version of vp9_dc_only_idct_add_c function. In order to improve performance, clipped the absolute diff values to [0, 255]. This allowed us to keep the additions/subtractions in 8 bits. Test showed an over 2% decoder performance increase. Change-Id: Ie1a236d23d207e4ffcd1fc9f3d77462a9c7fe09d	2013-02-26 17:16:13 -08:00
Jingning Han	77a3becf92	clean up forward and inverse hybrid transform Rebased. Remove the old matrix multiplication transform computation. The 16x16 ADST/DCT can be switched on/off and evaluated by setting ACTIVE_HT16 300/0 in vp9/common/vp9_blockd.h. Change-Id: Icab2dbd18538987e1dc4e88c45abfc4cfc6e133f	2013-02-25 09:16:12 -08:00
James Zern	e5fb6321a1	give vp9 variance struct a unique name variance_vtable clashed with vp8/common/variance.h Change-Id: I09c1de44d5519f1bd13f58c01144c0de4706de6f	2013-02-22 16:25:13 -08:00
Jingning Han	babbd5d170	Forward butterfly hybrid transform This patch includes 4x4, 8x8, and 16x16 forward butterfly ADST/DCT hybrid transform. The kernel of 4x4 ADST is sin((2k+1)(n+1)/(2N+1)). The kernel of 8x8/16x16 ADST is of the form sin((2k+1)(2n+1)/4N). Change-Id: I8f1ab3843ce32eb287ab766f92e0611e1c5cb4c1	2013-02-21 18:24:28 -08:00
Ronald S. Bultje	35524e2231	Remove "eobs" array in MACROBLOCKD. The information is a duplicate of "eob" in BLOCKD. Change-Id: Ia6416273bd004611da801e4bfa6e2d328d6f02a3	2013-02-21 10:07:36 -08:00
Yaowu Xu	d262e26cc7	Merge lossless experiment Change-Id: I7b7b8d4fda3a23699e0c920d727f8c15d37d43aa	2013-02-20 07:54:28 -08:00

1 2 3 4 5 ...

300 Commits