[Assignment 1b] Net Exploration

<aside> 💡 Explore ImageNet. ImageNet sample images, Kaggle ImageNet Mini 1000, What surprises you about this data set? What questions do you have? Thinking back to last week’s assignment, can you think of any ethical considerations around how this data was collected—Are there privacy considerations with the data?

</aside>

There are so many random things that I don't even know why the creator trained the machine to recognize them. Moreover, the photo data they use is sometimes suboptimal, meaning that the photos are blurred or show the subject at an unexpected angle.

When I first saw a collection of dog photos, I was surprised to find that they were not in the clean and polished format I had expected. Many of the photos had messy backgrounds or presented the dogs at odd angles. Some photos even featured a dog with a different breed or included humans in the picture. ...Wait, humans in a dog photo? And unexpectedly, my surprise did not end there.

The optimal photo data I expected, [continued]

But most of the photos contained a background and showed the subject from a tricky angle.

Some images were in black and white color.

After scrolling to the fish section, I noticed that there were more people in the photo. This was understandable since fish need to be held by humans when taken outside of the ocean. However, it was unclear who these people were. I tried to find a description or disclaimer on the website but couldn't find any related sentence. This finding brought me to the topic of privacy issues in machine learning model training.

Untitled

When using AI models, we usually interact with the output, not the process behind it. Whichever data the model has used to train, if the result is clear and accurate enough, most of us do not care about what happens behind the machine. However, seeing the real faces made me feel that we—including users and creators— might tend to ignore the privacy issue since its result is too beneficial. How are these models trained? I believe this is the point to be clarified when using data-trained features.

But ironically, the limited number of data would lead to biased result generation. The machines need more photo data to educate themselves to be neutral (i.e., recognizing every ethnicity as human). And to address this problem, the possibility of using random photos on the Internet conceivably increases. The solution to prevent one ethical issue could stimulate another moral issue to arise.

Thus the necessity of labeling the source becomes significant. If the image data has a valid source cited somewhere visible, it could be helpful to improve the awareness of the privacy issue of machine learning. The lady with the dog in the photo would be credited for her work participating in building the image classification. I (as a user) would finally be able to appreciate the man holding the fish.