This is not explicitly called out in the rules. We can piece something together from some general principles...
The general rule of thumb is that the concentration required to cast a spell is unambiguous to any attentive observer, regardless of the whether there are or are not vocal or somatic components. IMO this is really a "combat-centric" mechanic to completely avoid arguments over whether or not you can tell who is casting or whether you are allowed an AoO.
It would be no fun if you follow this standard rule outside of combat. So we are fudging things here.
IMO there are two relevant checks: Spellcraft, and Sense Motive vs. Bluff. I would add in generous circumstance modes. I would give each character observing both checks. Note that Spellcraft is trained only. I would make the Spellcraft check base DC 10 (it would be 15+ if you were trying to identify the exact spell).
Off the top of my head, here is a laundry list of circumstance mods I would apply to the observer for both checks:
Observer has 0 ranks in Knowledge Nature, -2
Observer has 5+ ranks in Knowledge Nature, +2
Observer unfamiliar w/animal type, e.g. a city slicker seeing a bear, -2
Effect of spell not immediately obvious, -4
Caster is mostly concealed, e.g. squirrel peering through the leaves while in a tree, -4
Caster in plain sight and spell has somatic components, +2
Observer distracted, e.g. inattentive guard, -5
Caster's animal form fits into scene seemlessly, e.g. one of a flock of pigeons, -5
Spell casting time longer than one action, +4
I did not include Spot checks because a hidden caster will never be noticed. This is also for the sake of simplicity. If caster were far away, I would add in a Spot check, too, of DC 10 plus distance mods (-1 per 10'). Failure would mean that the observer is effectively distracted (see above).
Observers would get a Listen check against DC 0 (remember distance mods) to notice there is something there. Spellcasting with vocal components is in a clear, loud voice, regardless of physical form.