Commit Graph

183 Commits

Author SHA1 Message Date
Jim Bankoski
7c15c18c5e removed the recon rtcd invoke macro code (unrevert)
This reinstates reverted commit 2113a83157

Change-Id: I9a9af13497d1e58d4f467e3e083fddf06b1b786c
2012-10-16 12:02:31 -07:00
Yunqing Wang
64075c9b01 Encoder denoiser performance improvement
The denoiser function was modified to reduce the computational
complexity.

1. The denoiser c function modification:
The original implementation calculated pixel's filter_coefficient
based on the pixel value difference between current raw frame and last
denoised raw frame, and stored them in lookup tables. For each pixel c,
find its coefficient using
    filter_coefficient[c] = LUT[abs_diff[c]];
and then apply filtering operation for the pixel.

The denoising filter costed about 12% of encoding time when it was
turned on, and half of the time was spent on finding coefficients in
lookup tables. In order to simplify the process, a short cut was taken.
The pixel adjustments vs. pixel diff value were calculated ahead of time.
    adjustment = filtered_value - current_raw
               = (filter_coefficient * diff + 128) >> 8

The adjustment vs. diff curve becomes flat very quick when diff increases.
This allowed us to use only several levels to get a close approximation
of the curve. Following the denoiser algorithm, the adjustments are
further modified according to how big the motion magnitude is.

2. The sse2 function was rewritten.

This change made denoiser filter function 3x faster, and improved the
encoder performance by 7% ~ 10% with the denoiser on.

Change-Id: I93a4308963b8e80c7307f96ffa8b8c667425bf50
2012-08-31 13:48:13 -07:00
Deb Mukherjee
7d0656537b Merging in the sixteenth subpel uv experiment
Merges this experiment in to make it easier to run tests on
filter precision, vectorized implementation etc.

Also removes an experimental filter.

Change-Id: I1e8706bb6d4fc469815123939e9c6e0b5ae945cd
2012-08-08 16:57:43 -07:00
John Koleszar
c6b9039fd9 Restyle code
Approximate the Google style guide[1] so that that there's a written
document to follow and tools to check compliance[2].

[1]: http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml
[2]: http://google-styleguide.googlecode.com/svn/trunk/cpplint/cpplint.py

Change-Id: Idf40e3d8dddcc72150f6af127b13e5dab838685f
2012-07-17 11:46:03 -07:00
John Koleszar
0164a1cc5b Fix pedantic compiler warnings
Allows building the library with the gcc -pedantic option, for improved
portabilty. In particular, this commit removes usage of C99/C++ style
single-line comments and dynamic struct initializers. This is a
continuation of the work done in commit 97b766a46, which removed most
of these warnings for decode only builds.

Change-Id: Id453d9c1d9f44cc0381b10c3869fabb0184d5966
2012-06-11 15:14:58 -07:00
Stefan Holmer
cd0bf0e407 Fixes a win build issue related to denoising.
Change-Id: I912384f526865089aa03ca8875591324e5c1c449
2012-05-31 15:44:28 +02:00
Stefan Holmer
0927a41139 Fixes a clang linking error.
Change-Id: I1d2db53129dc6ec068093ad1e5fc0d94110473b3
2012-05-31 10:52:20 +02:00
Stefan Holmer
d850034443 Added another denoising threshold for finding DC shifts.
Compares the sum of differences between the input block and the averaged
block. If they differ too much the block will not be filtered. Negligible
perfomance hit.

Change-Id: Ib1c31a265efd4d100b3abc4a1ea6675038c8ddde
2012-05-30 16:50:21 +02:00
Alpha Lam
0f7e4665ae Make libvpx Chromium build friendly
Add PRIVATE macro for adding private_extern directive for yasm
to hide global symbols. This is only enabled if -DCHROMIUM is used
with YASM.

Also fixed a small problem with	rtcd_defs.sh to guard TEMPORAL_DENOISING.

Change-Id: I9027fce3ebddcf20078293e4b86b396f21da7857
2012-05-23 18:15:05 -07:00
Christian Duvivier
38ddb426d0 Inline Intrinsic optimized Denoiser
Faster version of denoiser, cut cost by 1.7x for C path, by 3.3x for
SSE2 path.

Change-Id: I154786308550763bc0e3497e5fa5bfd1ce651beb
2012-05-21 07:54:20 -07:00
Attila Nagy
a91b42f022 Makes all global data in entropy.c const
Removes all runtime initialization of global data in entropy.c.
Precalculated values are used for initializing all entropy related
tabels.

First patch in a series to make sure code is reentrant.

Change-Id: I9aac91a2a26f96d73c6470d772a343df63bfe633
2012-04-17 12:12:58 +03:00
Paul Wilkins
c88d335f7d Only support improved quant
Deprecate fast quant and strict_quant code.
Small effect on quality as fast was used in first pass but the
effect is basically neutral across the derf set.

The rationale here is to reduce the number of code paths for
now to make experimentation easier. Optimized and fast code
options can be re-introduced later along with other  encode
speed options.

Change-Id: Ia30c5daf3dbc52e72c83b277a1d281e3c934cdad
2012-03-21 18:22:33 +00:00
Johann
e50f96a4a3 Move SAD and variance functions to common
The MFQE function of the postprocessor depends on these

Change-Id: I256a37c6de079fe92ce744b1f11e16526d06b50a
2012-03-05 16:50:33 -08:00
Deb Mukherjee
88b36eb0d9 Bug fix in ssse3 variance computation.
Fixes a bug that was introduced in the high precision mv patch.

Change-Id: Ieadb433ebe4c3ef3e0e63944dab11528bf8bd73a
2012-02-24 20:24:54 -08:00
Deb Mukherjee
18e90d744e Supporting high precision 1/8-pel motion vectors
This is the initial patch for supporting 1/8th pel
motion. Currently if we configure with enable-high-precision-mv,
all motion vectors would default to 1/8 pel. Encode and
decode syncs fine with the current code. In the next phase
the code will be refactored so that we can choose the 1/8
pel mode adaptively at a frame/segment/mb level.

Derf results:
http://www.corp.google.com/~debargha/vp8_results/enhinterp_hpmv.html
(about 0.83% better than 8-tap interpoaltion)

Patch 3: Rebased. Also adding 1/16th pel interpolation for U and V

Patch 4: HD results.
http://www.corp.google.com/~debargha/vp8_results/enhinterp_hd_hpmv.html
Seems impressive (unless I am doing something wrong).

Patch 5: Added mmx/sse for bilateral filtering, as well as enforced
use of c-versions of subpel filters with 8-taps and 1/16th pel;
Also redesigned the 8-tap filters to reduce the cut-off in order to
introduce a denoising effect. There is a new configure option
sixteenth-subpel-uv which will use 1/16 th pel interpolation for
uv, if the motion vectors have 1/8 pel accuracy.

With the fixes the results are promising on the derf set. The enhanced
interpolation option with 8-taps alone gives 3% improvement over thei
derf set:
http://www.corp.google.com/~debargha/vp8_results/enhinterpn.html

Results on high precision mv and on the hd set are to follow.

Patch 6: Adding a missing condition for CONFIG_SIXTEENTH_SUBPEL_UV in
vp8/common/x86/x86_systemdependent.c

Patch 7: Cleaning up various debug messages.

Patch 8: Merge conflict

Change-Id: I5b1d844457aefd7414a9e4e0e06c6ed38fd8cc04
2012-02-23 09:25:21 -08:00
Johann
6b151d436d Clarify 'max_sad' usage
Depending on implementation the optimized SAD functions may return early
when the calculated SAD exceeds max_sad.

Change-Id: I05ce5b2d34e6d45fb3ec2a450aa99c4f3343bf3a
2012-02-16 15:17:44 -08:00
Paul Wilkins
79d330d7d5 Code simplification
Removal of the pickinter.c and .h files and calls to this
code.

Removal of some code relating to real time and one pass
settings  though there is more to be done in this regard.

However,  vp8_set_speed_features() now
only supports modes 0 and 1 and speeds up to 3
so rd should always be set.

Change-Id: I62c0c1b6154ab499785baef310536080e87bc4d8
2012-02-16 17:21:20 +00:00
Paul Wilkins
9a8204d6ee Simplification of experimental code base.
Removed ~CONFIG_REALTIME_ONLY code.

Change-Id: I5fafff29a08acd8928699f9ddce8744787024d8c
2012-02-14 09:03:56 +00:00
Johann
169823428f Missed some variance casts
Change-Id: I9fb510f9421fb3c317a8e32e3058cee977ddf9fa
2012-02-10 11:07:33 -08:00
Johann
fea3556e20 Fix variance overflow
In the variance calculations the difference is summed and later squared.
When the sum exceeds sqrt(2^31) the value is treated as a negative when
it is shifted which gives incorrect results.

To fix this we cast the result of the multiplication as unsigned.

The alternative fix is to shift sum down by 4 before multiplying.
However that will reduce precision.

For 16x16 blocks the maximum sum is 65280 and sqrt(2^31) is 46340 (and
change).

PPC change is untested.

Change-Id: I1bad27ea0720067def6d71a6da5f789508cec265
2012-02-09 12:38:31 -08:00
Paul Wilkins
d90f0eb4c5 Removal of SEGFEATURES placeholder comments
This commit only involves the removal of placeholder comments
//#if CONFIG_SEGFEATURES.

Change-Id: I94b350daaf998ee0cfdde5aa25b1d3b0522ab816
2012-02-09 17:25:05 +00:00
John Koleszar
8aae246089 RTCD: finalize removal of old RTCD system
This is the final commit in the series converting to the new RTCD
system. It removes the encoder csystemdependent files and the remaining
global function pointers that didn't conform to the old RTCD system.

Change-Id: I9649706f1bb89f0cbf431ab0e3e7552d37be4d8e
2012-01-30 12:10:48 -08:00
John Koleszar
109b69a706 RTCD: add arnr functions
This commit continues the process of converting to the new RTCD
system. It removes the last of the VP8_ENCODER_RTCD struct references.

Change-Id: I2a44f52d7cccf5177e1ca98a028ead570d045395
2012-01-30 12:10:48 -08:00
John Koleszar
0b0bc8d098 RTCD: add motion search functions
This commit continues the process of converting to the new RTCD
system.

Change-Id: Ia5828b7ecc80db55b21916704aa3d54cbb98f625
2012-01-30 12:10:47 -08:00
John Koleszar
be8af188d0 RTCD: add block subtraction functions
This commit continues the process of converting to the new RTCD
system.

Change-Id: Id8a287fdd4bd050ea4452e1582ad85520f3081be
2012-01-30 12:10:47 -08:00
John Koleszar
61311e6103 RTCD: add quantizer functions
This commit continues the process of converting to the new RTCD
system.

Change-Id: Iba9df4c03a508e51c37201c621be43523fae87d9
2012-01-30 12:10:46 -08:00
John Koleszar
510e0ab467 RTCD: add FDCT functions
This commit continues the process of converting to the new RTCD
system.

Change-Id: I3f9c07db65eb206f6363d21bdb80e871570da767
2012-01-30 12:10:42 -08:00
John Koleszar
83a91e789c RTCD: add variance functions
This commit continues the process of converting to the new RTCD
system.

Change-Id: Ie5c1aa480637e98dc3918fb562ff45c37a66c538
2012-01-30 12:08:30 -08:00
Yunqing Wang
2b2c0c9bda Improve SSSE3 fast quantizer function
Simplified the EOB calculation in the function.

Change-Id: I7422f18be40ae270358f5cb0811d66e64436b56f
2011-12-29 12:05:50 -05:00
Johann
f2cd4ded22 Move shared data to shared location
Storing vp8_bilinear_filters_mmx in an mmx file and using it in an sse2
file is bad

Moving towards allowing --disable-mmx

Change-Id: I20493b35bdedcdcfc0915e6f05fdbe6c81a4a742
2011-11-18 16:23:14 -08:00
Scott LaVarnway
edd98b7310 Added predictor stride argument(s) to subtract functions
Patch set 2: 64 bit build fix
Patch set 3: 64 bit crash fix

[Tero]
Patch set 4: Updated ARMv6 and NEON assembly.
             Added also minor NEON optimizations to subtract
             functions.

Patch set 5: x86 stride bug fix

Change-Id: I1fcca93e90c89b89ddc204e1c18f208682675c15
2011-11-15 12:53:01 -05:00
Paul Wilkins
a10a268e58 Segment Features. Removal of #ifdefs
Removal of configure #ifdefs so that segment features
always available. Removal of code supporting old
segment feature method.

Still a good deal of tidying up to do.

Change-Id: I397855f086f8c09ab1fae0a5f65d9e06d2e3e39f
2011-11-03 17:14:26 +00:00
Tero Rintaluoma
e4f2ec7a52 Change use of eob in the encoder
Changed 'int eob' to 'char *eob' in BLOCKD so that both encoder and
decoder will use eobs[25] array from MACROBLOCKD structure. In future,
this will enable use of the decoder side IDCT in the encoder.

Change-Id: I6e1c011628cb8864fd4a0b80f0279ce16a5ca978
2011-11-03 16:08:09 +02:00
Paul Wilkins
01ce04bc06 Further segment feature extensions.
This quite large check in includes the following:

Merge in some code from Ronald (mbgraph.c) that scans a Gf/arf group.
This is used as a basis for a simple segmentation for the normal frames
in a gf/arf group. This code also uses satd functions from Yaowu.

Adds functionality for coding the latest possible position of an EOB for
blocks in the segment. (Currently 0-15 only, hence just for 4x4 dct).
Where the EOB position is 0 this acts like "skip" and the normal coding
of skip at the per mb level is disabled.

Added functions (seg_common.c) for setting and reading segment feature
elements. These may want to be optimized away at some point but while the
mecahnism is in a state of flux they provide a single location for making
changes and keep things a bit cleaner.

This is still proof of concept code. Currently the tested feature set:-

Quantizer,
Loop Filter level,
Reference frame,
Prediction Mode,
EOB end stop.

TBD:-

Add functions for setting and reading the feature data with range
and validity checking.

Handling of signed and unsigned feature data. At the moment all is assumed
to be signed and a sign bit is coded but many cannot be negative.

Correct handling of EOB feature with intra coded blocks.

Testing/trapping of legal/illegal ref frame and mode combinations.

Transform size switch plus merge and test with 8c8 DCT work

Merge and test with Sumans Segmenation coding optimizations

Change-Id: Iee12e83661c7abbd1e0ce6810915eb4ec35e2d8e
2011-10-24 15:52:18 +01:00
Attila Nagy
1a7d25a484 Replace vpx_ports/config.h with vpx_config.h
Just a clean-up.

Change-Id: Iea5b6dc925dcfa7db548bc1ab1a13d26ed5a2c9a
2011-09-22 13:33:54 +03:00
Fritz Koenig
c5f890af2c Use local labels for jumps/loops in x86 assembly.
Prepend . to local labels in assembly code.  This
allows non unique labels within a file.  Also
makes profiling information more informative
by keeping the function name with the loop name.

Change-Id: I7a983cb3a5ba2413d5dafd0a37936b268fb9e37f
2011-08-23 09:05:29 -07:00
Fritz Koenig
694d4e7777 Reclassify optimized ssim calculations as SSE2.
Calculations were incorrectly classified as either
SSE3 or SSSE3.  Only using SSE2 instructions.
Cleanup function names and make non-RTCD code work
as well.

Change-Id: I48ad0218af0cc51c5078070a08511dee43ecfe09
2011-08-22 12:36:28 -07:00
Fritz Koenig
734b1b2041 Revert "Reclasify optimized ssim calculations as SSE2."
This reverts commit 01376858cd
2011-08-22 11:31:12 -07:00
Fritz Koenig
01376858cd Reclasify optimized ssim calculations as SSE2.
Calculations were incorrectly classified as either
SSE3 or SSSE3.  Only using SSE2 instructions.
Cleanup function names and make non-RTCD code work
as well.

Change-Id: I29f5c2ead342b2086a468029c15e2c1d948b5d97
2011-08-19 08:51:27 -07:00
Yunqing Wang
fe270dd527 Specify size for argument pushed to stack
The change fixes building error on Win64.

Change-Id: I63d25b26220c4da8a98ca2e36530cbb802468e6b
2011-07-25 11:30:45 -04:00
Yunqing Wang
20bd1446c0 Preload reference area to an intermediate buffer in sub-pixel motion search
In sub-pixel motion search, the search range is small(+/- 3 pixels).
Preload whole search area from reference buffer into a 32-byte
aligned buffer. Then in search, load reference data from this buffer
instead. This keeps data in cache, and reduces the crossing cache-
line penalty. For tulip clip, tests on Intel Core2 Quad machine(linux)
showed encoder speed improvement:
  3.4%   at --rt --cpu-used =-4
  2.8%   at --rt --cpu-used =-3
  2.3%   at --rt --cpu-used =-2
  2.2%   at --rt --cpu-used =-1

Test on Atom notebook showed only 1.1% speed improvement(speed=-4).
Test on Xeon machine also showed less improvement, since unaligned
data access latency is greatly reduced in newer cores.

Next, I will apply similar idea to other 2 sub-pixel search functions
for encoding speed > 4.

Make this change exclusively for x86 platforms.

Change-Id: Ia7bb9f56169eac0f01009fe2b2f2ab5b61d2eb2f
2011-07-22 09:28:06 -04:00
Johann
92b0e544f3 fix --disable-runtime-cpu-detect on x86
Change-Id: Ib8e429152c9a8b6032be22b5faac802aa8224caa
2011-06-14 11:31:50 -04:00
Yaowu Xu
361717d2be remove one set of 16x16 variance funcations
call to this set of functions are replaced by var16x16.

Change-Id: I5ff1effc6c1358ea06cda1517b88ec28ef551b0d
2011-06-09 11:23:05 -07:00
Yaowu Xu
d4700731ca remove redundant functions
The encoder defined about 4 set of similar functions to calculate sum,
variance or sse or a combination of them. This commit removed one set
of these functions, get8x8var and get16x16var, where calls to the later
function are replaced with var16x16 by using the fact on a 16x16 MB:
    variance == sse - sum*sum/256

Change-Id: I803eabd1fb3ab177780a40338cbd596dffaed267
2011-06-06 16:44:05 -07:00
Yunqing Wang
b6679879b8 Return sse value in vp8_variance SSE2 functions
Minor modification.

Change-Id: I09511d38fd1451d5c4106a48acdb3f766ce59cb7
2011-05-25 11:55:41 -04:00
John Koleszar
c684d5e5f2 Merge "changed configure option name to reduce confusion" 2011-05-19 11:17:08 -07:00
Yunqing Wang
c7a56f677d Merge "Use diamond search to replace full search in full-pixel refining search" 2011-05-10 06:59:38 -07:00
Yunqing Wang
cb7b1fb144 Use diamond search to replace full search in full-pixel refining search
In NEWMV mode, currently, full search is used as the refining search
after n-step search. By replacing it with an iterative diamond search
of radius 1 largely reduced the computation complexity, but still
maintained the same encoding quality since the refining search is
done for every macroblock instead of only a small precentage of
macroblocks while using full search.

Tests on the test set showed a 3.4% encoding speed increase with none
psnr & ssim loss.

Change-Id: Ife907d7eb9544d15c34f17dc6e4cfd97cb743d41
2011-05-09 14:07:06 -04:00
Johann
a7d4d3c550 clean up unused variable warnings
Change-Id: I9467d7a50eac32d8e8f3a2f26db818e47c93c94b
2011-05-09 12:56:20 -04:00
Yaowu Xu
57ad189129 changed configure option name to reduce confusion
Renamed configure option "enable-psnr" to "enable-internal-stats" to
better reflect the purpose of the option and eliminate the confusion
reported in http://code.google.com/p/webm/issues/detail?id=35

Change-Id: If72df6fdb9f1e33dab1329240ba4d8911d2f1f7a
2011-04-29 09:39:05 -07:00
Johann
aeca599087 Merge "keep values in registers during quantization" 2011-04-25 06:52:38 -07:00
Ronald S. Bultje
496bcbb0de Fix overflow in temporal_filter_apply_sse2().
The accumulator array is an integer array, so use paddd instead of paddw
to add values to it. Fixes overflows when using large --arnr-maxframes
(>8) values.

Change-Id: Iad83794caa02400a65f3ab5760f2517e082d66ae
2011-04-22 10:00:38 -04:00
Johann
508ae1b3d5 keep values in registers during quantization
add an sse4 quantizer so we can use pinsrw/pextrw and keep values in xmm
registers instead of proxying through the stack. and as long as we're
bumping up, use some ssse3 instructions in the EOB detection (see ssse3
fast quantizer)
pick up about a percent on 32bit and about two on 64bit.

Change-Id: If15abba0e8b037a1d231c0edf33501545c9d9363
2011-04-21 15:47:55 -04:00
Johann
4a2b684ef4 modify SAVE_XMM for potential 64bit use
the win64 abi requires saving and restoring xmm6:xmm15. currently
SAVE_XMM and RESTORE XMM only allow for saving xmm6:xmm7. allow
specifying the highest register used and if the stack is unaligned.

Change-Id: Ica5699622ffe3346d3a486f48eef0206c51cf867
2011-04-19 10:42:45 -04:00
Johann
a9b465c5c9 Merge "Add save/restore xmm registers in x86 assembly code" 2011-04-19 06:32:10 -07:00
Johann
c7cfde42a9 Add save/restore xmm registers in x86 assembly code
Went through the code and fixed it. Verified on Windows.

Where possible, remove dependencies on xmm[67]

Current code relies on pushing rbp to the stack to get 16 byte
alignment. This broke when rbp wasn't pushed
(vp8/encoder/x86/sad_sse3.asm). Work around this by using unaligned
memory accesses. Revisit this and the offsets in
vp8/encoder/x86/sad_sse3.asm in another change to SAVE_XMM.

Change-Id: I5f940994d3ebfd977c3d68446cef20fd78b07877
2011-04-18 16:30:38 -04:00
Johann
cd103a5721 Merge "store quant_shift as an unsigned char" 2011-04-18 10:03:40 -07:00
Yaowu Xu
c619f6cb0f Merge "fixed an overflow in ssim calculation" 2011-04-18 07:44:34 -07:00
Johann
70f30aa95d store quant_shift as an unsigned char
in encodframe.c, quant_shift is set to 0 or 1 in vp8cx_invert_quant

only use 8 bits to store this, instead of 16. will allow saving an
xmm register in an updated version of the regular quantize

Change-Id: Ie88c47fe2aff5af0283dab1147fb2791e4b12f90
2011-04-13 13:50:12 -04:00
Yunqing Wang
4fd81a99f8 Set cpu_used range to [-16, 16] in real-time mode
Remove encoding speed limitation in real-time mode.

Change-Id: Ib5e35d8bb522b2a25f3e4ad5cfe2788ebebb3617
2011-04-11 15:55:04 -04:00
Jim Bankoski
d4cdb683a4 fixed an overflow in ssim calculation
This commit fixed an overflow in ssim calculation, added register
save and restore to make sure assembly code working for x64 platform.
It also changed the sampling points to every 4x4 instead of 8x8 and
adjusted the constants in SSIM calculation to match the scale of
previous VPXSSIM.

Change-Id: Ia4dbb8c69eac55812f4662c88ab4653b6720537b
2011-04-07 14:25:25 -07:00
Johann Koenig
08702002e8 use asm_offsets with vp8_fast_quantize_b_sse3
on the same order as the sse2 fast quantize change: ~2%
except for 32bit. only a slight improvment there.

Change-Id: Iff80e5f1ce7e646eebfdc8871405458ff911986b
2011-04-07 16:40:05 -04:00
James Berry
aec5487cdd Use correct 32 bit comparisons for SAD breakout.
Rax updated to eax to avoid uninitialized memory
usage.

Change-Id: Iedb953f104329ede2a786fc648a47f1be2f3798a
2011-04-07 15:08:03 -04:00
Johann
c32e0ecc59 use asm_offsets with vp8_fast_quantize_b_sse2
on the same order as the regular quantize change: ~2%

Change-Id: I5c9eec18e89ae7345dd96945cb740e6f349cee86
2011-04-04 16:23:29 -04:00
Johann
8520b5c785 tweak vp8_regular_quantize_b_sse2
rather than look up rc in the zig zag table, embed it in the macro. this
also allows us to shuffle some values in the macro and keep *d in rsi

gains of about the same order as the obj_int_extract implementation: ~2%

Change-Id: Ib7252dd10eee66e0af8b0e567426122781dc053d
2011-04-01 09:58:23 -04:00
Yunqing Wang
534ea700bd Merge "Fix a crash while enabling shared (--enable-shared)" 2011-03-29 09:04:22 -07:00
Yunqing Wang
b843aa4eda Fix a crash while enabling shared (--enable-shared)
Fixed a bug in SSSE3 sub-pixel filter functions.

Change-Id: I2e2126652970eb78307ffcefcace1efd5966fb0a
2011-03-29 11:31:06 -04:00
Johann
f0c22a3f33 use GLOBAL correctly on 32bit shared libraries
http://code.google.com/p/webm/issues/detail?id=309

Change-Id: I6fce9e2f74bc09a9f258df7f91ab599812324e8c
2011-03-29 11:27:03 -04:00
Johann
8edaf6e2f2 use asm_offsets with vp8_regular_quantize_b_sse2
remove helper function and avoid shadowing all the arguments to the
stack on 64bit systems

when running with --good --cpu-used=0:
~2% on linux x86 and x86_64
~2% on win32 x86 msys and visual studio
more on darwin10 x86_64
significantly more on
x86_64-win64-vs9

Change-Id: Ib7be12edf511fbf2922f191afd5b33b19a0c4ae6
2011-03-24 13:34:48 -04:00
John Koleszar
2cbd962088 Remove unused vp8_get4x4sse_cs_mmx declaration
This declaration did not match the prototype_sad() prototype, but was
unused in this translation unit, so it is removed instead. Fixes
issue 290.

Change-Id: I168854f88a85f73ca9aaf61d1e5dc0f43fc3fdb3
2011-03-21 07:53:53 -04:00
John Koleszar
429dc676b1 Increase static linkage, remove unused functions
A large number of functions were defined with external linkage, even
though they were only used from within one file. This patch changes
their linkage to static and removes the vp8_ prefix from their names,
which should make it more obvious to the reader that the function is
contained within the current translation unit. Functions that were
not referenced were removed.

These symbols were identified by:

  $ nm -A libvpx.a | sort -k3 | uniq -c -f2 | grep ' [A-Z] ' \
    | sort | grep '^ *1 '

Change-Id: I59609f58ab65312012c047036ae1e0634f795779
2011-03-17 20:53:47 -04:00
Yunqing Wang
3c9dd6c3ef Merge "Align SAD output array to be 16-byte aligned" 2011-03-11 05:56:02 -08:00
Jim Bankoski
3f6f7289aa vp8cx- alternate ssim function with optimizations
Change-Id: I91921b0a90dbaddc7010380b038955be347964b3
2011-03-11 08:51:21 -05:00
Yunqing Wang
b2aa401776 Align SAD output array to be 16-byte aligned
Use aligned store.

Change-Id: Icab4c0c53da811d0c52bb7e8134927f249ba2499
2011-03-11 08:24:23 -05:00
Yunqing Wang
7b8e7f0f3a Add vp8_sub_pixel_variance16x8_ssse3 function
Added SSSE3 function

Change-Id: I8c304c92458618d93fda3a2f62bd09ccb63e75ad
2011-03-09 12:33:21 -05:00
Yunqing Wang
4561109a69 Remove unused functions
Removed some unused functions

Change-Id: Ifdfc27453e53cfc75997b38492901d193a16b245
2011-03-09 10:45:03 -05:00
Yunqing Wang
419f638910 Improve SSE2 half-pixel filter funtions
Rewrote these functions to process 16 pixels once instead of 8.

Change-Id: Ic67e80124467a446a3df4cfecfb76a4248602adb
2011-03-08 16:25:06 -05:00
Yunqing Wang
8432a1729f Add zero offset checking in SSE2 sub-pixel filter function
Skip filter at zero offset.

Change-Id: I95fc7e211869bc0ab5bcfb7ab2e3259d1c0ccf38
2011-03-08 15:22:07 -05:00
Yunqing Wang
244e2e1451 Write SSSE3 sub-pixel filter function
1. Process 16 pixels at one time instead of 8.
2. Add check for both xoffset =0 and yoffset=0, which happens
   during motion search.
This change gave encoder 1%~3% performance gain.

Change-Id: Idaa39506b48f4f8b2fbbeb45aae8226fa32afb3e
2011-03-08 13:29:01 -05:00
Yunqing Wang
cfaee9f7c6 Merge "Add prefetch before variance calculation" 2011-02-28 11:42:28 -08:00
Yunqing Wang
d96ba65a23 Add prefetch before variance calculation
This improved encoding performance by 0.5% (good, speed 1) to
1.5% (good, speed 5).

Change-Id: I843d72a0d68a90b5f694adf770943e4a4618f50e
2011-02-28 11:25:55 -05:00
Attila Nagy
7af0d906e3 Remove temporal alt ref from realtime only build
It is not used in realtime mode. Reduces memory footprint.

Change-Id: I7f163225762368df5457cfd413050161d3704a3f
2011-02-22 12:53:32 +02:00
Johann
945dad277d Revert "use unaligned load"
This reverts commit f50f2fd2a7.

Change Ib7506e3e aligns the buffer

Change-Id: Ie0f8bd3e57cfdfef81d39638a1451458ebbae2e0
2011-02-18 10:23:02 -05:00
John Koleszar
c351aa7f1b Merge "Fix relative include paths" 2011-02-17 04:13:44 -08:00
Yunqing Wang
2debd5b5f7 Improve vp8_sad16x16_sse3 function
In real-time mode, vp8_sad16x16 function is called heavily in
motion search part. Improvement of this function gives 1.2%
encoding performance gain (real-time mode, tulip clip).

Change-Id: I23c401fc40c061f732a9767e8d383737a179bd58
2011-02-14 16:23:49 -05:00
John Koleszar
02321de0f2 Fix relative include paths
Allow compiling without adding vp8/{common,encoder,decoder} to the
include paths.

Change-Id: Ifeb5dac351cdfadcd659736f5158b315a0030b6c
2011-02-10 15:09:44 -05:00
Johann
907e98fbb5 Merge "update sse2 regular quantizer" 2011-01-25 13:40:28 -08:00
Yunqing Wang
0822a62f40 Modify sub-pixel filters to eliminate unnecessary calculations
In sub-pixel calculation, xoffset and yoffset mostly take some
specific values. Modified sub-pixel filter functions according to
these possible values to improve performance.

Change-Id: I83083570af8b00ff65093467914fbb97a4e9ea21
2011-01-21 13:59:27 -05:00
Attila Nagy
cb791aaa2f Fix encoder real-time only configuration.
Remove allocation/deallocation of stats storage.
Remove full search functions in machine specific encoder inits.
Remove last pass validation in  validate_config.

Change-Id: I7f29be69273981a4fef6e80ecdb6217c68cbad4e
2011-01-18 08:19:21 -05:00
Johann
15f9bea73b update sse2 regular quantizer
about ~5% gain on 32bit. disabled for 64bit

unset executable bit on ssse3 version (cosmetic)

Change-Id: I1a5860839eb294ce4261f819caea2dcfa78e57ca
2011-01-14 14:26:10 -05:00
Johann
f50f2fd2a7 use unaligned load
source buffer is not guaranteed to be aligned for odd size buffers

Change-Id: Id0b1fd40ba3bd6c994bcfada788feccd2b53c5a9
2011-01-11 11:22:29 -05:00
Johann
8b0cf5f79d x86 sse2 temporal_filter_apply
count can be reduced to short because the max number of filtered frames
is set to 15. the max value for any frame is 32 (modifier = 16,
filter_weight = 2). 15*32 = 480 which requires 9 bits

this function goes from about 7000 us / 1000 iterations for the C code
to < 275 us / 1000 iterations for sse2 for block_size = 16 and from
about 1800 us / 1000 iters to < 100 us / 1000 iters for block_size = 8

Change-Id: I64a32607f58a2d33c39286f468b04ccd457d9e6e
2011-01-06 14:00:30 -05:00
Scott LaVarnway
516ea8460b Use the fast quantizer for inter mode selection
Use the fast quantizer for inter mode selection and the
regular quantizer for the rest of the encode for good quality,
speed 1.  Both performance and quality were improved.  The
quality gains will make up for the quality loss mentioned in
I9dc089007ca08129fb6c11fe7692777ebb8647b0.

Change-Id: Ia90bc9cf326a7c65d60d31fa32f6465ab6984d21
2010-12-28 14:51:46 -05:00
John Koleszar
b1aa54ab26 remove unused temporal preproc code
This code is unused, as the current preproc implementation uses the
same spatial filter that postproc uses.

Change-Id: Ia06d5664917d67283f279e2480016bebed602ea7
2010-12-13 16:47:59 -05:00
Fritz Koenig
e0cf330cde vp8 fast quantizer sse2 optimizations for eob.
Changed the end of block computation to use pmaxw.  Removed
additional pushing and popping of registers that was not needed.

Change-Id: I08cb9b424513cd8a2c7ad8cea53b4e2adc66ef98
2010-12-09 15:00:30 -08:00
Fritz Koenig
e180255375 Remove stack shadowing for x86-x64 for SAD functions.
x86-64 passes arguments in registers.  There is no need to push
them to the stack before using them.

This fixes 15acc84f10 where ebx
was not getting preserved on x86.

Change-Id: I1214b5f818a0201f75ab6ad7d5c6f448e09b16c2
2010-11-15 10:56:02 -08:00
Fritz Koenig
58083cb34d Revert "Remove stack shadowing for x86-64"
This reverts commit 15acc84f10.

Change-Id: Ia640be8cbc134432914849c1750f62575ea084e6
2010-11-11 08:20:02 -08:00
Fritz Koenig
9b1ece2cca Merge "Remove stack shadowing for x86-64" 2010-11-10 14:36:10 -08:00
Fritz Koenig
5f0e0617ba FDCT optimizations.
Fixed up the fdct for mmx and 8x4 sse2 to match them
most recent changes.

Change-Id: Ibee2d6c536fe14dcf75cd6eb1c73f4848a56d719
2010-11-10 14:34:02 -08:00
Scott LaVarnway
ff4a71f4c2 SSSE3 version of fast quantizer
(test clip: tulip)
For good quality mode with speed=1, this gave the encoder
a small (2 - 3%) performance boost.

Change-Id: I8a1d4269465944ac0819986c2f0be4b0a2ee0b35
2010-11-01 16:24:15 -04:00