Categories
Misc

Training AffectNet with Cross-validation CNN (Tensorflow)

submitted by /u/blevlabs
[visit reddit] [comments]

Categories
Misc

tf.compat.v1.layers.batch_normalization vs tf.contrib.layers.batch_norm

Hi all,

I’ve a TF1.X code with the tf.contrib.layers.batch_norm layer, and Im not sure how to replace it, is it with the tf.compat.v1.layers.batch_normalization layer? Certain variables, such as data_format and scope, are no longer present, so Im not sure it is the correct replacement.

submitted by /u/dxjustice
[visit reddit] [comments]

Categories
Misc

What is an appropriate project to help learn tensor flow? (Python)

I’ve been meaning to learn how to write machine learning programs in python. As keen as I am, I haven’t found an easy project to get me started.

Any suggestions?

submitted by /u/DrHooBs
[visit reddit] [comments]

Categories
Misc

BatchNormalization Layer is causing ValueError: tf.function only supports singleton tf.Variables created on the first call

I’m training a deep and wide model with a convolutional side which I’m using inception blocks for. I needed to put in some Batch Normalization layers to stop exploding gradients, and I get a ValueError that points to the BatchNormalization layer creating multiple variables. I can’t find anyone else with this problem, so I don’t know what is causing it. I found that if I set it to eager mode, the error doesn’t come up during training, but will prevent me from saving my model. Any ideas on what is causing this?

submitted by /u/SaveShark
[visit reddit] [comments]

Categories
Misc

Can someone suggest a good TensorFlow2 tutorial like this one?

Hi,

I’m a student trying to learn TensorFlow2 by myself. I found this PDF document with nearly 100 pages which teaches TensorFlow1 really well. Can someone suggest something like this for TensorFlow2? Please do not suggest video tutorials.

Thanks for any help in advance.

submitted by /u/Dgreenfox
[visit reddit] [comments]

Categories
Misc

Is labels file critical to tensorflow model function?

preface – My 7yo daughter wants to setup a camera that tells her when birds land at her bird feeders so I’m trying to help her (and take the opportunity to expose her to software/code and apparently ML)… I am in no way a software engineer or coder, so please excuse any complete ignorance…

I’m trying to implement this model (with python 3.10). I’ve found several tutorials that seem straight forward enough BUT… when I download and decompile the project (pulling it off of TFhub was causing issues so I figured this would be a simpler starting point) it doesn’t have a labels file, which all the info I’ve found seems to require… I did find an excel doc linked within the description, but it is just two columns (id and name). do I need to load this as the labels file? and how should I do that (dict?)? I’m assuming the the TF model will output an array of IDs with probabilities it’s correct? and I can then convert those IDs to a “name” however I want? or is the labels file critical to the model functioning?

Thanks in advance for any help.

submitted by /u/StrongAbbreviations5
[visit reddit] [comments]

Categories
Misc

Preparing Time Series Data for LSTMs

I have no formal education here but my understanding is that RNNs take an input window and “unfold it”, basing each prediction in part on those prior. Say I have a batch size of 1: There shouldn’t be a relationship between the first batch and second, correct? (if not, tell me; the rest is irrelevant)

Does it follow from my understanding that it’s safe to

  • Have overlapping windows in my data? (so conceptually, have batches 0, 1, 2 = data[0:4, 1:5, 2: 6])
  • Split into fit/val sets derived from random choices of windows? (rather than just slicing twice)
  • Shuffle data after windowing?

submitted by /u/EX3000
[visit reddit] [comments]

Categories
Misc

How did the Keras Syntax changed over the time?

Hello,

I recently stumbled over a video tutorial with code examples. I tried to figure out what happens in the code, and did some research in the keras developer guides. I don’t have much experience with coding, but I think the syntax, which is used in my Code example is completely different from the syntax in the official keras docs…

My question would be if there was a significant change in the last years of writing python code with keras? And if the answer is yes, does it still make sense to work with the older syntax?

submitted by /u/LeiseLeo
[visit reddit] [comments]

Categories
Offsites

TRILLsson: Small, Universal Speech Representations for Paralinguistic Tasks

In recent years, we have seen dramatic improvements on lexical tasks such as automatic speech recognition (ASR). However, machine systems still struggle to understand paralinguistic aspects — such as tone, emotion, whether a speaker is wearing a mask, etc. Understanding these aspects represents one of the remaining difficult problems in machine hearing. In addition, state-of-the-art results often come from ultra-large models trained on private data, making them impractical to run on mobile devices or to release publicly.

In “Universal Paralinguistic Speech Representations Using Self-Supervised Conformers”, to appear in ICASSP 2022, we introduce CAP12— the 12th layer of a 600M parameter model trained on the YT-U training dataset using self-supervision. We demonstrate that the CAP12 model outperforms nearly all previous results in our paralinguistic benchmark, sometimes by large margins, even though previous results are often task-specific. In “TRILLsson: Distilled Universal Paralinguistic Speech Representations”, we introduce the small, performant, publicly-available TRILLsson models and demonstrate how we reduced the size of the high-performing CAP12 model by 6x-100x while maintaining 90-96% of the performance. To create TRILLsson, we apply knowledge distillation on appropriately-sized audio chunks and use different architecture types to train smaller, faster networks that are small enough to run on mobile devices.

1M-Hour Dataset to Train Ultra-Large Self-Supervised Models
We leverage the YT-U training dataset to train the ultra-large, self-supervised CAP12 model. The YT-U dataset is a highly varied, 900M+ hour dataset that contains audio of various topics, background conditions, and speaker acoustic properties.

Video categories by length (outer) and number (inner), demonstrating the variety in the YT-U dataset (figure from BigSSL)

We then modify a Wav2Vec 2.0 self-supervised training paradigm, which can solve tasks using raw data without labels, and combine it with ultra-large Conformer models. Because self-training doesn’t require labels, we can take full advantage of YT-U by scaling up our models to some of the largest model sizes ever trained, including 600M, 1B, and 8B parameters.

NOSS: A Benchmark for Paralinguistic Tasks
We demonstrate that an intermediate representation of one of the previous models contains a state-of-the-art representation for paralinguistic speech. We call the 600M parameter Conformer model without relative attention Conformer Applied to Paralinguistics (CAP). We exhaustively search through all intermediate representations of six ultra-large models and find that layer 12 (CAP12) outperforms previous representations by significant margins.

To measure the quality of the roughly 300 candidate paralinguistic speech representations, we evaluate on an expanded version of the NOn-Semantic Speech (NOSS) benchmark, which is a collection of well-studied paralinguistic speech tasks, such as speech emotion recognition, language identification, and speaker identification. These tasks focus on paralinguistics aspects of speech, which require evaluating speech features on the order of 1 second or longer, rather than lexical features, which require 100ms or shorter. We then add to the benchmark a mask-wearing task introduced at Interspeech 2020, a fake speech detection task (ASVSpoof 2019), a task to detect the level of dysarthria from project Euphonia, and an additional speech emotion recognition task (IEMOCAP). By expanding the benchmark and increasing the diversity of the tasks, we empirically demonstrate that CAP12 is even more generally useful than previous representations.

Simple linear models on time-averaged CAP12 representations even outperform complex, task-specific models on five out of eight paralinguistic tasks. This is surprising because comparable models sometimes use additional modalities (e.g., vision and speech, or text and speech) as well. Furthermore, CAP12 is exceptionally good at emotion recognition tasks. CAP12 embeddings also outperform all other embeddings on all other tasks with only a single exception: for one embedding from a supervised network on the dysarthria detection task.

Model Voxceleb   Voxforge   Speech Commands   ASVSpoof2019∗∗   Euphonia#   CREMA-D   IEMOCAP
Prev SoTA 95.4 97.9 5.11 45.9 74.0 67.6+
TRILL 12.6 84.5 77.6 74.6 48.1 65.7 54.3
ASR Embedding 5.2 98.9 96.1 11.2 54.5 71.8 65.4
Wav2Vec2 layer 6†† 17.9 98.5 95.0 6.7 48.2 77.4 65.8
CAP12 51.0 99.7 97.0 2.5 51.5 88.2 75.0
Test performance on the NOSS Benchmark and extended tasks. “Prev SoTA” indicates the previous best performing state-of-the-art model, which has arbitrary complexity, but all other rows are linear models on time-averaged input. Filtered according to YouTube’s privacy guidelines. ∗∗ Uses equal error rate [20]. # The only non-public dataset. We exclude it from aggregate scores. Audio and visual features used in previous state-of-the-art models. + The previous state-of-the-art model performed cross-validation. For our evaluation, we hold out two specific speakers as a test. †† Wav2Vec 2.0 model from HuggingFace. Best overall layer was layer 6.

TRILLsson: Small, High Quality, Publicly Available Models
Similar to FRILL, our next step was to make an on-device, publicly available version of CAP12. This involved using knowledge distillation to train smaller, faster, mobile-friendly architectures. We experimented with EfficientNet, Audio Spectrogram Transformer (AST), and ResNet. These model types are very different, and cover both fixed-length and arbitrary-length inputs. EfficientNet comes from a neural architecture search over vision models to find simultaneously performant and efficient model structures. AST models are transformers adapted to audio inputs. ResNet is a standard architecture that has shown good performance across many different models.

We trained models that performed on average 90-96% as well as CAP12, despite being 1%-15% the size and trained using only 6% the data. Interestingly, we found that different architecture types performed better at different sizes. ResNet models performed best at the low end, EfficientNet in the middle, and AST models at the larger end.

Aggregate embedding performance vs. model size for various student model architectures and sizes. We demonstrate that ResNet architectures perform best for small sizes, EfficientNetV2 performs best in the midsize model range, up to the largest model size tested, after which the larger AST models are best.

We perform knowledge distillation with the goal of matching a student, with a fixed-size input, to the output of a teacher, with a variable-size input, for which there are two methods of generating student targets: global matching and local matching. Global matching produces distillation targets by generating CAP12 embeddings for an entire audio clip, and then requires that a student match the target from just a small segment of audio (e.g., 2 seconds). Local matching requires that the student network match the average CAP12 embedding just over the smaller portion of the audio that the student sees. In our work, we focused on local matching.

Two types of generating distillation targets for sequences. Left: Global matching uses the average CAP12 embedding over the whole clip for the target for each local chunk. Right: Local matching uses CAP12 embeddings averaged just over local clips as the distillation target.

Observation of Bimodality and Future Directions
Paralinguistic information shows an unexpected bimodal distribution. For the CAP model that operates on 500 ms input segments, and two of the full-input Conformer models, intermediate representations gradually increase in paralinguistic information, then decrease, then increase again, and finally lose this information towards the output layer. Surprisingly, this pattern is also seen when exploring the intermediate representations of networks trained on retinal images.

500 ms inputs to CAP show a relatively pronounced bimodal distribution of paralinguistic information across layers.
Two of the conformer models with full inputs show a bimodal distribution of paralinguistic information across layers.

We hope that smaller, faster models for paralinguistic speech unlock new applications in speech recognition, text-to-speech generation, and understanding user intent. We also expect that smaller models will be more easily interpretable, which will allow researchers to understand what aspects of speech are important for paralinguistics. Finally, we hope that our open-sourced speech representations are used by the community to improve paralinguistic speech tasks and user understanding in private or small datasets.

Acknowledgements
I’d like to thank my co-authors Aren Jansen, Wei Han, Daniel Park, Yu Zhang, and Subhashini Venugopalan for their hard work and creativity on this project. I’d also like to thank the members of the large collaboration for the BigSSL work, without which these projects would not be possible. The team includes James Qin, Anmol Gulati, Yuanzhong Xu, Yanping Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang, and Yonghui Wu.

Categories
Misc

Annoying behavior of developers not keeping reverse compatibility

Hi, I’m a new student trying to learn Tensorflow. I have already found some really good books and tutorials which helps me learn this fast. But when I started trying out the examples given I soon realized that all the good tutorials that I have are discussing about Tensorflow 1.15 and amazingly these codes will not work with tensorflow 2.0+.

I really find this cool and amazing behavior from the devolpers who have zero concern about reverse-compatibility. I can go to Google and fix the old code line by line, replacing each with the Tensorflow2 equivalent.

But since I’m a beginner, this is a nightmare for me. Can anyone explain to me in simple terms why these douchebags does not maintain reverse compatibility when these airheads update these libraries?

I really want to find the developers who did this and dip their face in boiling oil.

P.s: when is Tensorflow3 coming out? I’m now trying to learn Tensorflow2, i assume that Tensorflow3 will be completely different from Tensorflow2 and we would have to re-learn everything from scratch in that too.

submitted by /u/Dgreenfox
[visit reddit] [comments]